Tidy-Data

Tidy-Data#

Hadley Wickham wrote a great article in favor of “tidy data.” Tidy data follows the rules:

Each variable is a column.
Each observation is a row.
Each type of observation has its own separate table / DataFrame.

This is less pretty to visualize as a table, but we rarely look at data in tables. Indeed, the representation of data which is convenient for visualization is different from that which is convenient for analysis. A tidy data frame is almost always much easier to work with than non-tidy formats.

Let’s import a prepared table with measurements and have a closer look. Is this table tidy?

import pandas as pd

df = pd.read_csv('../data/Multi_analysis.csv', header = [0,1], sep=';')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26 entries, 0 to 25
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   (Before, channel_1)  26 non-null     float64
 1   (Before, channel_2)  26 non-null     float64
 2   (After, channel_1)   26 non-null     float64
 3   (After, channel_2)   26 non-null     float64
dtypes: float64(4)
memory usage: 960.0 bytes

df.head()

	Before		After
	channel_1	channel_2	channel_1	channel_2
0	13.250000	21.000000	15.137984	42.022776
1	44.954545	24.318182	43.328836	48.661610
2	13.590909	18.772727	11.685995	37.926184
3	85.032258	19.741935	86.031461	40.396353
4	10.731707	25.268293	10.075421	51.471865

df.columns

MultiIndex([('Before', 'channel_1'),
            ('Before', 'channel_2'),
            ( 'After', 'channel_1'),
            ( 'After', 'channel_2')],
           )

A useful method for tidying a DataFrame is Pandas’ melt. It reformats the DataFrame from wide format to long format, incorporating the columns’ index labels into column values.

df.melt()

	variable_0	variable_1	value
0	Before	channel_1	13.250000
1	Before	channel_1	44.954545
2	Before	channel_1	13.590909
3	Before	channel_1	85.032258
4	Before	channel_1	10.731707
...	...	...	...
99	After	channel_2	73.286439
100	After	channel_2	145.900739
101	After	channel_2	115.347217
102	After	channel_2	61.225962
103	After	channel_2	77.490249

104 rows × 3 columns

We can specify names for the to-be-created columns for the value and the variables. In this case, our measurements are of intensity, and our variables are intervention (before or after) and channel.

# df_tidy = df.melt(value_name='intensity', var_name=['intervention', 'channel'])
# For the above command, there's a bug in the current Pandas version - not allowing lists for var_name
df_tidy = df.melt(value_name='intensity')
df_tidy

	variable_0	variable_1	intensity
0	Before	channel_1	13.250000
1	Before	channel_1	44.954545
2	Before	channel_1	13.590909
3	Before	channel_1	85.032258
4	Before	channel_1	10.731707
...	...	...	...
99	After	channel_2	73.286439
100	After	channel_2	145.900739
101	After	channel_2	115.347217
102	After	channel_2	61.225962
103	After	channel_2	77.490249

104 rows × 3 columns

Finally, renaming the variable columns:

df_tidy.rename(columns={'variable_0':'intervention', 'variable_1':'channel'}, inplace=True)
df_tidy

	intervention	channel	intensity
0	Before	channel_1	13.250000
1	Before	channel_1	44.954545
2	Before	channel_1	13.590909
3	Before	channel_1	85.032258
4	Before	channel_1	10.731707
...	...	...	...
99	After	channel_2	73.286439
100	After	channel_2	145.900739
101	After	channel_2	115.347217
102	After	channel_2	61.225962
103	After	channel_2	77.490249

104 rows × 3 columns

A DataFrame formatted this way is easier to manipulate in analysis, because now we can more easily mask by columns. Here we select intensity measurement rows for “channel_2” and “After” intervention.

df_tidy.loc[(df_tidy['channel'] == 'channel_2') & (df_tidy['intervention'] == 'After'), 'intensity']

    42.022776
    48.661610
    37.926184
    40.396353
    51.471865
    73.347843
    69.902829
    70.156432
   104.525198
    50.563301
    51.381594
    56.543107
    45.215405
    81.326111
   235.067654
    69.820702
    59.870177
    68.493363
    67.379506
   207.956510
    58.361239
    73.286439
  145.900739
  115.347217
   61.225962
   77.490249
Name: intensity, dtype: float64