{ "cells": [ { "metadata": {}, "cell_type": "markdown", "source": [ "# Tidy-Data\n", "\n", "Hadley Wickham wrote a [great article](https://www.jstatsoft.org/article/view/v059i10) in favor of “tidy data.” Tidy data follows the rules:\n", "* Each variable is a column.\n", "* Each observation is a row.\n", "* Each type of observation has its own separate table / DataFrame.\n", "\n", "This is less pretty to visualize as a table, but we rarely look at data in tables. Indeed, the representation of data which is convenient for visualization is different from that which is convenient for analysis. A tidy data frame is almost always much easier to work with than non-tidy formats.\n", "\n", "Let's import a prepared table with measurements and have a closer look. Is this table tidy?" ], "id": "280458fefd76e886" }, { "metadata": { "ExecuteTime": { "end_time": "2024-04-30T14:23:42.200066Z", "start_time": "2024-04-30T14:23:42.030166Z" } }, "cell_type": "code", "source": "import pandas as pd", "id": "9020cdef099aaef5", "outputs": [], "execution_count": 1 }, { "metadata": { "ExecuteTime": { "end_time": "2024-04-30T14:23:42.209012Z", "start_time": "2024-04-30T14:23:42.201153Z" } }, "cell_type": "code", "source": [ "df = pd.read_csv('../data/Multi_analysis.csv', header = [0,1], sep=';')\n", "df.info()" ], "id": "94736f96f6db1511", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 26 entries, 0 to 25\n", "Data columns (total 4 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 (Before, channel_1) 26 non-null float64\n", " 1 (Before, channel_2) 26 non-null float64\n", " 2 (After, channel_1) 26 non-null float64\n", " 3 (After, channel_2) 26 non-null float64\n", "dtypes: float64(4)\n", "memory usage: 960.0 bytes\n" ] } ], "execution_count": 2 }, { "metadata": { "ExecuteTime": { "end_time": "2024-04-30T14:23:42.215624Z", "start_time": "2024-04-30T14:23:42.209698Z" } }, "cell_type": "code", "source": "df.head()", "id": "c409a946b45aad45", "outputs": [ { "data": { "text/plain": [ " Before After \n", " channel_1 channel_2 channel_1 channel_2\n", "0 13.250000 21.000000 15.137984 42.022776\n", "1 44.954545 24.318182 43.328836 48.661610\n", "2 13.590909 18.772727 11.685995 37.926184\n", "3 85.032258 19.741935 86.031461 40.396353\n", "4 10.731707 25.268293 10.075421 51.471865" ], "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
BeforeAfter
channel_1channel_2channel_1channel_2
013.25000021.00000015.13798442.022776
144.95454524.31818243.32883648.661610
213.59090918.77272711.68599537.926184
385.03225819.74193586.03146140.396353
410.73170725.26829310.07542151.471865
\n", "
" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 3 }, { "metadata": { "ExecuteTime": { "end_time": "2024-04-30T14:23:42.218703Z", "start_time": "2024-04-30T14:23:42.216368Z" } }, "cell_type": "code", "source": "df.columns", "id": "6eeeab541af60b97", "outputs": [ { "data": { "text/plain": [ "MultiIndex([('Before', 'channel_1'),\n", " ('Before', 'channel_2'),\n", " ( 'After', 'channel_1'),\n", " ( 'After', 'channel_2')],\n", " )" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 4 }, { "metadata": {}, "cell_type": "markdown", "source": "A useful method for tidying a DataFrame is Pandas' [melt](https://pandas.pydata.org/docs/reference/api/pandas.melt.html). It reformats the DataFrame from wide format to long format, incorporating the columns' index labels into column values.", "id": "4ced42156ad8f11e" }, { "metadata": { "ExecuteTime": { "end_time": "2024-04-30T14:23:42.224674Z", "start_time": "2024-04-30T14:23:42.219940Z" } }, "cell_type": "code", "source": "df.melt()", "id": "34c88d6d62d52724", "outputs": [ { "data": { "text/plain": [ " variable_0 variable_1 value\n", "0 Before channel_1 13.250000\n", "1 Before channel_1 44.954545\n", "2 Before channel_1 13.590909\n", "3 Before channel_1 85.032258\n", "4 Before channel_1 10.731707\n", ".. ... ... ...\n", "99 After channel_2 73.286439\n", "100 After channel_2 145.900739\n", "101 After channel_2 115.347217\n", "102 After channel_2 61.225962\n", "103 After channel_2 77.490249\n", "\n", "[104 rows x 3 columns]" ], "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
variable_0variable_1value
0Beforechannel_113.250000
1Beforechannel_144.954545
2Beforechannel_113.590909
3Beforechannel_185.032258
4Beforechannel_110.731707
............
99Afterchannel_273.286439
100Afterchannel_2145.900739
101Afterchannel_2115.347217
102Afterchannel_261.225962
103Afterchannel_277.490249
\n", "

104 rows × 3 columns

\n", "
" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 5 }, { "metadata": {}, "cell_type": "markdown", "source": "We can specify names for the to-be-created columns for the value and the variables. In this case, our measurements are of intensity, and our variables are intervention (before or after) and channel.", "id": "c28e5bf63141a57f" }, { "metadata": { "ExecuteTime": { "end_time": "2024-04-30T14:23:42.230168Z", "start_time": "2024-04-30T14:23:42.225383Z" } }, "cell_type": "code", "source": [ "# df_tidy = df.melt(value_name='intensity', var_name=['intervention', 'channel'])\n", "# For the above command, there's a bug in the current Pandas version - not allowing lists for var_name\n", "df_tidy = df.melt(value_name='intensity')\n", "df_tidy" ], "id": "874959e9f47a9c49", "outputs": [ { "data": { "text/plain": [ " variable_0 variable_1 intensity\n", "0 Before channel_1 13.250000\n", "1 Before channel_1 44.954545\n", "2 Before channel_1 13.590909\n", "3 Before channel_1 85.032258\n", "4 Before channel_1 10.731707\n", ".. ... ... ...\n", "99 After channel_2 73.286439\n", "100 After channel_2 145.900739\n", "101 After channel_2 115.347217\n", "102 After channel_2 61.225962\n", "103 After channel_2 77.490249\n", "\n", "[104 rows x 3 columns]" ], "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
variable_0variable_1intensity
0Beforechannel_113.250000
1Beforechannel_144.954545
2Beforechannel_113.590909
3Beforechannel_185.032258
4Beforechannel_110.731707
............
99Afterchannel_273.286439
100Afterchannel_2145.900739
101Afterchannel_2115.347217
102Afterchannel_261.225962
103Afterchannel_277.490249
\n", "

104 rows × 3 columns

\n", "
" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 6 }, { "metadata": {}, "cell_type": "markdown", "source": "Finally, renaming the variable columns:", "id": "ea1a0c2c1985f437" }, { "metadata": { "ExecuteTime": { "end_time": "2024-04-30T14:23:42.235038Z", "start_time": "2024-04-30T14:23:42.230779Z" } }, "cell_type": "code", "source": [ "df_tidy.rename(columns={'variable_0':'intervention', 'variable_1':'channel'}, inplace=True)\n", "df_tidy" ], "id": "14becd0c0ace466", "outputs": [ { "data": { "text/plain": [ " intervention channel intensity\n", "0 Before channel_1 13.250000\n", "1 Before channel_1 44.954545\n", "2 Before channel_1 13.590909\n", "3 Before channel_1 85.032258\n", "4 Before channel_1 10.731707\n", ".. ... ... ...\n", "99 After channel_2 73.286439\n", "100 After channel_2 145.900739\n", "101 After channel_2 115.347217\n", "102 After channel_2 61.225962\n", "103 After channel_2 77.490249\n", "\n", "[104 rows x 3 columns]" ], "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
interventionchannelintensity
0Beforechannel_113.250000
1Beforechannel_144.954545
2Beforechannel_113.590909
3Beforechannel_185.032258
4Beforechannel_110.731707
............
99Afterchannel_273.286439
100Afterchannel_2145.900739
101Afterchannel_2115.347217
102Afterchannel_261.225962
103Afterchannel_277.490249
\n", "

104 rows × 3 columns

\n", "
" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 7 }, { "metadata": {}, "cell_type": "markdown", "source": "A DataFrame formatted this way is easier to manipulate in analysis, because now we can more easily mask by columns. Here we select intensity measurement rows for \"channel_2\" and \"After\" intervention.", "id": "dbc60a76c8709b17" }, { "metadata": { "ExecuteTime": { "end_time": "2024-04-30T14:23:42.238876Z", "start_time": "2024-04-30T14:23:42.235784Z" } }, "cell_type": "code", "source": "df_tidy.loc[(df_tidy['channel'] == 'channel_2') & (df_tidy['intervention'] == 'After'), 'intensity']", "id": "d06781f500062d26", "outputs": [ { "data": { "text/plain": [ "78 42.022776\n", "79 48.661610\n", "80 37.926184\n", "81 40.396353\n", "82 51.471865\n", "83 73.347843\n", "84 69.902829\n", "85 70.156432\n", "86 104.525198\n", "87 50.563301\n", "88 51.381594\n", "89 56.543107\n", "90 45.215405\n", "91 81.326111\n", "92 235.067654\n", "93 69.820702\n", "94 59.870177\n", "95 68.493363\n", "96 67.379506\n", "97 207.956510\n", "98 58.361239\n", "99 73.286439\n", "100 145.900739\n", "101 115.347217\n", "102 61.225962\n", "103 77.490249\n", "Name: intensity, dtype: float64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 8 } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 5 }