Tidy-Data#

Hadley Wickham wrote a great article in favor of “tidy data.” Tidy data follows the rules:

  • Each variable is a column.

  • Each observation is a row.

  • Each type of observation has its own separate table / DataFrame.

This is less pretty to visualize as a table, but we rarely look at data in tables. Indeed, the representation of data which is convenient for visualization is different from that which is convenient for analysis. A tidy data frame is almost always much easier to work with than non-tidy formats.

Let’s import a prepared table with measurements and have a closer look. Is this table tidy?

import pandas as pd
df = pd.read_csv('../data/Multi_analysis.csv', header = [0,1], sep=';')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26 entries, 0 to 25
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   (Before, channel_1)  26 non-null     float64
 1   (Before, channel_2)  26 non-null     float64
 2   (After, channel_1)   26 non-null     float64
 3   (After, channel_2)   26 non-null     float64
dtypes: float64(4)
memory usage: 960.0 bytes
df.head()
Before After
channel_1 channel_2 channel_1 channel_2
0 13.250000 21.000000 15.137984 42.022776
1 44.954545 24.318182 43.328836 48.661610
2 13.590909 18.772727 11.685995 37.926184
3 85.032258 19.741935 86.031461 40.396353
4 10.731707 25.268293 10.075421 51.471865
df.columns
MultiIndex([('Before', 'channel_1'),
            ('Before', 'channel_2'),
            ( 'After', 'channel_1'),
            ( 'After', 'channel_2')],
           )

A useful method for tidying a DataFrame is Pandas’ melt. It reformats the DataFrame from wide format to long format, incorporating the columns’ index labels into column values.

df.melt()
variable_0 variable_1 value
0 Before channel_1 13.250000
1 Before channel_1 44.954545
2 Before channel_1 13.590909
3 Before channel_1 85.032258
4 Before channel_1 10.731707
... ... ... ...
99 After channel_2 73.286439
100 After channel_2 145.900739
101 After channel_2 115.347217
102 After channel_2 61.225962
103 After channel_2 77.490249

104 rows × 3 columns

We can specify names for the to-be-created columns for the value and the variables. In this case, our measurements are of intensity, and our variables are intervention (before or after) and channel.

# df_tidy = df.melt(value_name='intensity', var_name=['intervention', 'channel'])
# For the above command, there's a bug in the current Pandas version - not allowing lists for var_name
df_tidy = df.melt(value_name='intensity')
df_tidy
variable_0 variable_1 intensity
0 Before channel_1 13.250000
1 Before channel_1 44.954545
2 Before channel_1 13.590909
3 Before channel_1 85.032258
4 Before channel_1 10.731707
... ... ... ...
99 After channel_2 73.286439
100 After channel_2 145.900739
101 After channel_2 115.347217
102 After channel_2 61.225962
103 After channel_2 77.490249

104 rows × 3 columns

Finally, renaming the variable columns:

df_tidy.rename(columns={'variable_0':'intervention', 'variable_1':'channel'}, inplace=True)
df_tidy
intervention channel intensity
0 Before channel_1 13.250000
1 Before channel_1 44.954545
2 Before channel_1 13.590909
3 Before channel_1 85.032258
4 Before channel_1 10.731707
... ... ... ...
99 After channel_2 73.286439
100 After channel_2 145.900739
101 After channel_2 115.347217
102 After channel_2 61.225962
103 After channel_2 77.490249

104 rows × 3 columns

A DataFrame formatted this way is easier to manipulate in analysis, because now we can more easily mask by columns. Here we select intensity measurement rows for “channel_2” and “After” intervention.

df_tidy.loc[(df_tidy['channel'] == 'channel_2') & (df_tidy['intervention'] == 'After'), 'intensity']
78      42.022776
79      48.661610
80      37.926184
81      40.396353
82      51.471865
83      73.347843
84      69.902829
85      70.156432
86     104.525198
87      50.563301
88      51.381594
89      56.543107
90      45.215405
91      81.326111
92     235.067654
93      69.820702
94      59.870177
95      68.493363
96      67.379506
97     207.956510
98      58.361239
99      73.286439
100    145.900739
101    115.347217
102     61.225962
103     77.490249
Name: intensity, dtype: float64