Tidy-Data

Tidy-Data#

Hadley Wickham wrote a great article in favor of “tidy data.” Tidy data follows the rules:

Each variable is a column.
Each observation is a row.
Each type of observation has its separate table / DataFrame.

This is less pretty to visualize as a table, but we rarely look at data in tables. Indeed, the representation of data which is convenient for visualization is different from that which is convenient for analysis. A tidy data frame is almost always much easier to work with than non-tidy formats.

Let’s import a prepared table with measurements and have a closer look. Is this table tidy?

import pandas as pd

df = pd.read_csv('data/multi-index-measurements.csv', index_col=[0,1], parse_dates=[1], header=[0,1])
df.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 6 entries, ('dev_1', Timestamp('2023-01-01 07:00:00')) to ('dev_2', Timestamp('2023-01-01 09:00:00'))
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   (before, value1)  6 non-null      float64
 1   (before, value2)  6 non-null      float64
 2   (before, value3)  6 non-null      float64
 3   (after, value1)   6 non-null      float64
 4   (after, value2)   6 non-null      float64
 5   (after, value3)   6 non-null      float64
dtypes: float64(6)
memory usage: 588.0+ bytes

df

		before			after
	device	value1	value2	value3	value1	value2	value3
dev_1	2023-01-01 07:00:00	9.0	2.0	3.1	9.2	2.1	3.15
	2023-01-01 08:00:00	10.1	2.5	3.0	10.3	2.6	3.05
	2023-01-01 09:00:00	9.8	3.2	3.2	10.0	3.3	3.25
dev_2	2023-01-01 07:00:00	9.3	2.2	3.1	9.5	2.3	3.15
	2023-01-01 08:00:00	10.4	2.7	3.0	10.6	2.8	3.05
	2023-01-01 09:00:00	10.1	3.4	3.2	10.3	3.5	3.25

A useful method for tidying a DataFrame is Pandas’ melt(). It reformats the DataFrame from wide format to long format, incorporating the columns’ index labels into column values.

df.melt().head()

	None	device	value
0	before	value1	9.0
1	before	value1	10.1
2	before	value1	9.8
3	before	value1	9.3
4	before	value1	10.4

By providing additional parameters, we can specify names for the to-be-created columns for the values. Before this, we need to flatten the multi-index for the columns and put the index back as regular columns.

# Work on a copy to keep the original intact
tidy = df.copy()
# First flatten the MultiIndex columns -> 'before_value1'
flat_cols = []
for c in tidy.columns:
    if isinstance(c, tuple):
        if c[0] in ("", None):
            flat_cols.append(c[1])
        else:
            flat_cols.append(f"{c[0]}_{c[1]}")
    else:
        flat_cols.append(c)

tidy.columns = flat_cols
tidy.head()

		before_value1	before_value2	before_value3	after_value1	after_value2	after_value3
dev_1	2023-01-01 07:00:00	9.0	2.0	3.1	9.2	2.1	3.15
	2023-01-01 08:00:00	10.1	2.5	3.0	10.3	2.6	3.05
	2023-01-01 09:00:00	9.8	3.2	3.2	10.0	3.3	3.25
dev_2	2023-01-01 07:00:00	9.3	2.2	3.1	9.5	2.3	3.15
dev_2	2023-01-01 08:00:00	10.4	2.7	3.0	10.6	2.8	3.05

# Then bring the index columns "device","timestamp"" back as regular columns
tidy.index.set_names(["device","timestamp"], inplace=True)
tidy.reset_index(inplace=True)
tidy.head()

	device	timestamp	before_value1	before_value2	before_value3	after_value1	after_value2	after_value3
0	dev_1	2023-01-01 07:00:00	9.0	2.0	3.1	9.2	2.1	3.15
1	dev_1	2023-01-01 08:00:00	10.1	2.5	3.0	10.3	2.6	3.05
2	dev_1	2023-01-01 09:00:00	9.8	3.2	3.2	10.0	3.3	3.25
3	dev_2	2023-01-01 07:00:00	9.3	2.2	3.1	9.5	2.3	3.15
4	dev_2	2023-01-01 08:00:00	10.4	2.7	3.0	10.6	2.8	3.05

# Then melt the DataFrame
tidy = tidy.melt(
    id_vars=["device", "timestamp"],
    var_name="phase_variable", # var_name only allows scalar names
    value_name="value"
).assign(
    phase=lambda x: x["phase_variable"].str.split("_", expand=True)[0], # split phase_variable into two new columns
    variable=lambda x: x["phase_variable"].str.split("_", expand=True)[1] # split phase_variable into two new columns
).drop(
    columns=["phase_variable"] # drop the now redundant column
)
# Reorder and sort columns
tidy = tidy[["device", "timestamp", "phase", "variable", "value"]].sort_values(["device", "timestamp", "phase", "variable"])

tidy.head(10)

	device	timestamp	phase	variable	value
18	dev_1	2023-01-01 07:00:00	after	value1	9.20
24	dev_1	2023-01-01 07:00:00	after	value2	2.10
30	dev_1	2023-01-01 07:00:00	after	value3	3.15
0	dev_1	2023-01-01 07:00:00	before	value1	9.00
6	dev_1	2023-01-01 07:00:00	before	value2	2.00
12	dev_1	2023-01-01 07:00:00	before	value3	3.10
19	dev_1	2023-01-01 08:00:00	after	value1	10.30
25	dev_1	2023-01-01 08:00:00	after	value2	2.60
31	dev_1	2023-01-01 08:00:00	after	value3	3.05
1	dev_1	2023-01-01 08:00:00	before	value1	10.10

A DataFrame formatted this way is easier to manipulate in analysis, because now we can more easily mask by columns.

# Here we select the values for "dev_1", "before" and "value3".
tidy[
    (tidy["device"] == "dev_1") &
    (tidy["phase"] == "before") &
    (tidy["variable"] == "value3")
]

	device	timestamp	phase	variable	value
12	dev_1	2023-01-01 07:00:00	before	value3	3.1
13	dev_1	2023-01-01 08:00:00	before	value3	3.0
14	dev_1	2023-01-01 09:00:00	before	value3	3.2