Tidy-Data#
Hadley Wickham wrote a great article in favor of “tidy data.” Tidy data follows the rules:
Each variable is a column.
Each observation is a row.
Each type of observation has its own separate table / DataFrame.
This is less pretty to visualize as a table, but we rarely look at data in tables. Indeed, the representation of data which is convenient for visualization is different from that which is convenient for analysis. A tidy data frame is almost always much easier to work with than non-tidy formats.
Let’s import a prepared table with measurements and have a closer look. Is this table tidy?
import pandas as pd
df = pd.read_csv('../data/Multi_analysis.csv', header = [0,1], sep=';')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26 entries, 0 to 25
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 (Before, channel_1) 26 non-null float64
1 (Before, channel_2) 26 non-null float64
2 (After, channel_1) 26 non-null float64
3 (After, channel_2) 26 non-null float64
dtypes: float64(4)
memory usage: 960.0 bytes
df.head()
Before | After | |||
---|---|---|---|---|
channel_1 | channel_2 | channel_1 | channel_2 | |
0 | 13.250000 | 21.000000 | 15.137984 | 42.022776 |
1 | 44.954545 | 24.318182 | 43.328836 | 48.661610 |
2 | 13.590909 | 18.772727 | 11.685995 | 37.926184 |
3 | 85.032258 | 19.741935 | 86.031461 | 40.396353 |
4 | 10.731707 | 25.268293 | 10.075421 | 51.471865 |
df.columns
MultiIndex([('Before', 'channel_1'),
('Before', 'channel_2'),
( 'After', 'channel_1'),
( 'After', 'channel_2')],
)
A useful method for tidying a DataFrame is Pandas’ melt. It reformats the DataFrame from wide format to long format, incorporating the columns’ index labels into column values.
df.melt()
variable_0 | variable_1 | value | |
---|---|---|---|
0 | Before | channel_1 | 13.250000 |
1 | Before | channel_1 | 44.954545 |
2 | Before | channel_1 | 13.590909 |
3 | Before | channel_1 | 85.032258 |
4 | Before | channel_1 | 10.731707 |
... | ... | ... | ... |
99 | After | channel_2 | 73.286439 |
100 | After | channel_2 | 145.900739 |
101 | After | channel_2 | 115.347217 |
102 | After | channel_2 | 61.225962 |
103 | After | channel_2 | 77.490249 |
104 rows × 3 columns
We can specify names for the to-be-created columns for the value and the variables. In this case, our measurements are of intensity, and our variables are intervention (before or after) and channel.
# df_tidy = df.melt(value_name='intensity', var_name=['intervention', 'channel'])
# For the above command, there's a bug in the current Pandas version - not allowing lists for var_name
df_tidy = df.melt(value_name='intensity')
df_tidy
variable_0 | variable_1 | intensity | |
---|---|---|---|
0 | Before | channel_1 | 13.250000 |
1 | Before | channel_1 | 44.954545 |
2 | Before | channel_1 | 13.590909 |
3 | Before | channel_1 | 85.032258 |
4 | Before | channel_1 | 10.731707 |
... | ... | ... | ... |
99 | After | channel_2 | 73.286439 |
100 | After | channel_2 | 145.900739 |
101 | After | channel_2 | 115.347217 |
102 | After | channel_2 | 61.225962 |
103 | After | channel_2 | 77.490249 |
104 rows × 3 columns
Finally, renaming the variable columns:
df_tidy.rename(columns={'variable_0':'intervention', 'variable_1':'channel'}, inplace=True)
df_tidy
intervention | channel | intensity | |
---|---|---|---|
0 | Before | channel_1 | 13.250000 |
1 | Before | channel_1 | 44.954545 |
2 | Before | channel_1 | 13.590909 |
3 | Before | channel_1 | 85.032258 |
4 | Before | channel_1 | 10.731707 |
... | ... | ... | ... |
99 | After | channel_2 | 73.286439 |
100 | After | channel_2 | 145.900739 |
101 | After | channel_2 | 115.347217 |
102 | After | channel_2 | 61.225962 |
103 | After | channel_2 | 77.490249 |
104 rows × 3 columns
A DataFrame formatted this way is easier to manipulate in analysis, because now we can more easily mask by columns. Here we select intensity measurement rows for “channel_2” and “After” intervention.
df_tidy.loc[(df_tidy['channel'] == 'channel_2') & (df_tidy['intervention'] == 'After'), 'intensity']
78 42.022776
79 48.661610
80 37.926184
81 40.396353
82 51.471865
83 73.347843
84 69.902829
85 70.156432
86 104.525198
87 50.563301
88 51.381594
89 56.543107
90 45.215405
91 81.326111
92 235.067654
93 69.820702
94 59.870177
95 68.493363
96 67.379506
97 207.956510
98 58.361239
99 73.286439
100 145.900739
101 115.347217
102 61.225962
103 77.490249
Name: intensity, dtype: float64