{
 "cells": [
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "# Tidy-Data\n",
    "\n",
    "Hadley Wickham wrote a [great article](https://www.jstatsoft.org/article/view/v059i10) in favor of “tidy data.” Tidy data follows the rules:\n",
    "* Each variable is a column.\n",
    "* Each observation is a row.\n",
    "* Each type of observation has its own separate table / DataFrame.\n",
    "\n",
    "This is less pretty to visualize as a table, but we rarely look at data in tables. Indeed, the representation of data which is convenient for visualization is different from that which is convenient for analysis. A tidy data frame is almost always much easier to work with than non-tidy formats.\n",
    "\n",
    "Let's import a prepared table with measurements and have a closer look. Is this table tidy?"
   ],
   "id": "280458fefd76e886"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-30T14:23:42.200066Z",
     "start_time": "2024-04-30T14:23:42.030166Z"
    }
   },
   "cell_type": "code",
   "source": "import pandas as pd",
   "id": "9020cdef099aaef5",
   "outputs": [],
   "execution_count": 1
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-30T14:23:42.209012Z",
     "start_time": "2024-04-30T14:23:42.201153Z"
    }
   },
   "cell_type": "code",
   "source": [
    "df = pd.read_csv('../data/Multi_analysis.csv', header = [0,1], sep=';')\n",
    "df.info()"
   ],
   "id": "94736f96f6db1511",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 26 entries, 0 to 25\n",
      "Data columns (total 4 columns):\n",
      " #   Column               Non-Null Count  Dtype  \n",
      "---  ------               --------------  -----  \n",
      " 0   (Before, channel_1)  26 non-null     float64\n",
      " 1   (Before, channel_2)  26 non-null     float64\n",
      " 2   (After, channel_1)   26 non-null     float64\n",
      " 3   (After, channel_2)   26 non-null     float64\n",
      "dtypes: float64(4)\n",
      "memory usage: 960.0 bytes\n"
     ]
    }
   ],
   "execution_count": 2
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-30T14:23:42.215624Z",
     "start_time": "2024-04-30T14:23:42.209698Z"
    }
   },
   "cell_type": "code",
   "source": "df.head()",
   "id": "c409a946b45aad45",
   "outputs": [
    {
     "data": {
      "text/plain": [
       "      Before                 After           \n",
       "   channel_1  channel_2  channel_1  channel_2\n",
       "0  13.250000  21.000000  15.137984  42.022776\n",
       "1  44.954545  24.318182  43.328836  48.661610\n",
       "2  13.590909  18.772727  11.685995  37.926184\n",
       "3  85.032258  19.741935  86.031461  40.396353\n",
       "4  10.731707  25.268293  10.075421  51.471865"
      ],
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr th {\n",
       "        text-align: left;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th colspan=\"2\" halign=\"left\">Before</th>\n",
       "      <th colspan=\"2\" halign=\"left\">After</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th>channel_1</th>\n",
       "      <th>channel_2</th>\n",
       "      <th>channel_1</th>\n",
       "      <th>channel_2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>13.250000</td>\n",
       "      <td>21.000000</td>\n",
       "      <td>15.137984</td>\n",
       "      <td>42.022776</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>44.954545</td>\n",
       "      <td>24.318182</td>\n",
       "      <td>43.328836</td>\n",
       "      <td>48.661610</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>13.590909</td>\n",
       "      <td>18.772727</td>\n",
       "      <td>11.685995</td>\n",
       "      <td>37.926184</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>85.032258</td>\n",
       "      <td>19.741935</td>\n",
       "      <td>86.031461</td>\n",
       "      <td>40.396353</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>10.731707</td>\n",
       "      <td>25.268293</td>\n",
       "      <td>10.075421</td>\n",
       "      <td>51.471865</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "execution_count": 3
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-30T14:23:42.218703Z",
     "start_time": "2024-04-30T14:23:42.216368Z"
    }
   },
   "cell_type": "code",
   "source": "df.columns",
   "id": "6eeeab541af60b97",
   "outputs": [
    {
     "data": {
      "text/plain": [
       "MultiIndex([('Before', 'channel_1'),\n",
       "            ('Before', 'channel_2'),\n",
       "            ( 'After', 'channel_1'),\n",
       "            ( 'After', 'channel_2')],\n",
       "           )"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "execution_count": 4
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "A useful method for tidying a DataFrame is Pandas' [melt](https://pandas.pydata.org/docs/reference/api/pandas.melt.html). It reformats the DataFrame from wide format to long format, incorporating the columns' index labels into column values.",
   "id": "4ced42156ad8f11e"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-30T14:23:42.224674Z",
     "start_time": "2024-04-30T14:23:42.219940Z"
    }
   },
   "cell_type": "code",
   "source": "df.melt()",
   "id": "34c88d6d62d52724",
   "outputs": [
    {
     "data": {
      "text/plain": [
       "    variable_0 variable_1       value\n",
       "0       Before  channel_1   13.250000\n",
       "1       Before  channel_1   44.954545\n",
       "2       Before  channel_1   13.590909\n",
       "3       Before  channel_1   85.032258\n",
       "4       Before  channel_1   10.731707\n",
       "..         ...        ...         ...\n",
       "99       After  channel_2   73.286439\n",
       "100      After  channel_2  145.900739\n",
       "101      After  channel_2  115.347217\n",
       "102      After  channel_2   61.225962\n",
       "103      After  channel_2   77.490249\n",
       "\n",
       "[104 rows x 3 columns]"
      ],
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>variable_0</th>\n",
       "      <th>variable_1</th>\n",
       "      <th>value</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Before</td>\n",
       "      <td>channel_1</td>\n",
       "      <td>13.250000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Before</td>\n",
       "      <td>channel_1</td>\n",
       "      <td>44.954545</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Before</td>\n",
       "      <td>channel_1</td>\n",
       "      <td>13.590909</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Before</td>\n",
       "      <td>channel_1</td>\n",
       "      <td>85.032258</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Before</td>\n",
       "      <td>channel_1</td>\n",
       "      <td>10.731707</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>99</th>\n",
       "      <td>After</td>\n",
       "      <td>channel_2</td>\n",
       "      <td>73.286439</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>100</th>\n",
       "      <td>After</td>\n",
       "      <td>channel_2</td>\n",
       "      <td>145.900739</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>101</th>\n",
       "      <td>After</td>\n",
       "      <td>channel_2</td>\n",
       "      <td>115.347217</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>102</th>\n",
       "      <td>After</td>\n",
       "      <td>channel_2</td>\n",
       "      <td>61.225962</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>103</th>\n",
       "      <td>After</td>\n",
       "      <td>channel_2</td>\n",
       "      <td>77.490249</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>104 rows × 3 columns</p>\n",
       "</div>"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "execution_count": 5
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "We can specify names for the to-be-created columns for the value and the variables. In this case, our measurements are of intensity, and our variables are intervention (before or after) and channel.",
   "id": "c28e5bf63141a57f"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-30T14:23:42.230168Z",
     "start_time": "2024-04-30T14:23:42.225383Z"
    }
   },
   "cell_type": "code",
   "source": [
    "# df_tidy = df.melt(value_name='intensity', var_name=['intervention', 'channel'])\n",
    "# For the above command, there's a bug in the current Pandas version - not allowing lists for var_name\n",
    "df_tidy = df.melt(value_name='intensity')\n",
    "df_tidy"
   ],
   "id": "874959e9f47a9c49",
   "outputs": [
    {
     "data": {
      "text/plain": [
       "    variable_0 variable_1   intensity\n",
       "0       Before  channel_1   13.250000\n",
       "1       Before  channel_1   44.954545\n",
       "2       Before  channel_1   13.590909\n",
       "3       Before  channel_1   85.032258\n",
       "4       Before  channel_1   10.731707\n",
       "..         ...        ...         ...\n",
       "99       After  channel_2   73.286439\n",
       "100      After  channel_2  145.900739\n",
       "101      After  channel_2  115.347217\n",
       "102      After  channel_2   61.225962\n",
       "103      After  channel_2   77.490249\n",
       "\n",
       "[104 rows x 3 columns]"
      ],
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>variable_0</th>\n",
       "      <th>variable_1</th>\n",
       "      <th>intensity</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Before</td>\n",
       "      <td>channel_1</td>\n",
       "      <td>13.250000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Before</td>\n",
       "      <td>channel_1</td>\n",
       "      <td>44.954545</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Before</td>\n",
       "      <td>channel_1</td>\n",
       "      <td>13.590909</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Before</td>\n",
       "      <td>channel_1</td>\n",
       "      <td>85.032258</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Before</td>\n",
       "      <td>channel_1</td>\n",
       "      <td>10.731707</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>99</th>\n",
       "      <td>After</td>\n",
       "      <td>channel_2</td>\n",
       "      <td>73.286439</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>100</th>\n",
       "      <td>After</td>\n",
       "      <td>channel_2</td>\n",
       "      <td>145.900739</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>101</th>\n",
       "      <td>After</td>\n",
       "      <td>channel_2</td>\n",
       "      <td>115.347217</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>102</th>\n",
       "      <td>After</td>\n",
       "      <td>channel_2</td>\n",
       "      <td>61.225962</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>103</th>\n",
       "      <td>After</td>\n",
       "      <td>channel_2</td>\n",
       "      <td>77.490249</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>104 rows × 3 columns</p>\n",
       "</div>"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "execution_count": 6
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "Finally, renaming the variable columns:",
   "id": "ea1a0c2c1985f437"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-30T14:23:42.235038Z",
     "start_time": "2024-04-30T14:23:42.230779Z"
    }
   },
   "cell_type": "code",
   "source": [
    "df_tidy.rename(columns={'variable_0':'intervention', 'variable_1':'channel'}, inplace=True)\n",
    "df_tidy"
   ],
   "id": "14becd0c0ace466",
   "outputs": [
    {
     "data": {
      "text/plain": [
       "    intervention    channel   intensity\n",
       "0         Before  channel_1   13.250000\n",
       "1         Before  channel_1   44.954545\n",
       "2         Before  channel_1   13.590909\n",
       "3         Before  channel_1   85.032258\n",
       "4         Before  channel_1   10.731707\n",
       "..           ...        ...         ...\n",
       "99         After  channel_2   73.286439\n",
       "100        After  channel_2  145.900739\n",
       "101        After  channel_2  115.347217\n",
       "102        After  channel_2   61.225962\n",
       "103        After  channel_2   77.490249\n",
       "\n",
       "[104 rows x 3 columns]"
      ],
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>intervention</th>\n",
       "      <th>channel</th>\n",
       "      <th>intensity</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Before</td>\n",
       "      <td>channel_1</td>\n",
       "      <td>13.250000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Before</td>\n",
       "      <td>channel_1</td>\n",
       "      <td>44.954545</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Before</td>\n",
       "      <td>channel_1</td>\n",
       "      <td>13.590909</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Before</td>\n",
       "      <td>channel_1</td>\n",
       "      <td>85.032258</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Before</td>\n",
       "      <td>channel_1</td>\n",
       "      <td>10.731707</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>99</th>\n",
       "      <td>After</td>\n",
       "      <td>channel_2</td>\n",
       "      <td>73.286439</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>100</th>\n",
       "      <td>After</td>\n",
       "      <td>channel_2</td>\n",
       "      <td>145.900739</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>101</th>\n",
       "      <td>After</td>\n",
       "      <td>channel_2</td>\n",
       "      <td>115.347217</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>102</th>\n",
       "      <td>After</td>\n",
       "      <td>channel_2</td>\n",
       "      <td>61.225962</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>103</th>\n",
       "      <td>After</td>\n",
       "      <td>channel_2</td>\n",
       "      <td>77.490249</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>104 rows × 3 columns</p>\n",
       "</div>"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "execution_count": 7
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "A DataFrame formatted this way is easier to manipulate in analysis, because now we can more easily mask by columns. Here we select intensity measurement rows for \"channel_2\" and \"After\" intervention.",
   "id": "dbc60a76c8709b17"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-30T14:23:42.238876Z",
     "start_time": "2024-04-30T14:23:42.235784Z"
    }
   },
   "cell_type": "code",
   "source": "df_tidy.loc[(df_tidy['channel'] == 'channel_2') & (df_tidy['intervention'] == 'After'), 'intensity']",
   "id": "d06781f500062d26",
   "outputs": [
    {
     "data": {
      "text/plain": [
       "78      42.022776\n",
       "79      48.661610\n",
       "80      37.926184\n",
       "81      40.396353\n",
       "82      51.471865\n",
       "83      73.347843\n",
       "84      69.902829\n",
       "85      70.156432\n",
       "86     104.525198\n",
       "87      50.563301\n",
       "88      51.381594\n",
       "89      56.543107\n",
       "90      45.215405\n",
       "91      81.326111\n",
       "92     235.067654\n",
       "93      69.820702\n",
       "94      59.870177\n",
       "95      68.493363\n",
       "96      67.379506\n",
       "97     207.956510\n",
       "98      58.361239\n",
       "99      73.286439\n",
       "100    145.900739\n",
       "101    115.347217\n",
       "102     61.225962\n",
       "103     77.490249\n",
       "Name: intensity, dtype: float64"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "execution_count": 8
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}