{ "cells": [ { "metadata": { "pycharm": { "name": "#%% md\n" } }, "cell_type": "markdown", "source": [ "# Introduction to working with DataFrames\n", "\n", "In basic Python, we often use dictionaries containing data as vectors, like our measurements. While these basic structures are convenient\n", " for collecting data, they are suboptimal for advanced data processing. For that, we introduce [pandas](https://pandas.pydata.org/), one of the primary tools in the Python Data Science ecosystem for handling data. Pandas' primary object, the \"DataFrame\", is a 2-dimensional labeled data structure with columns of different data types, and is extremely useful in wrangling data. \n", "\n", "See also:\n", "* [DataFrame - API reference](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)\n", "* [DataFrame - user guide](https://pandas.pydata.org/docs/user_guide/dsintro.html#dataframe)" ], "id": "92204799a7a3d7c8" }, { "metadata": { "ExecuteTime": { "end_time": "2024-04-30T14:12:10.467622Z", "start_time": "2024-04-30T14:12:09.831096Z" }, "pycharm": { "name": "#%%\n" } }, "cell_type": "code", "source": [ "import pandas as pd" ], "id": "9125bc100baf5f39", "outputs": [], "execution_count": 1 }, { "metadata": { "pycharm": { "name": "#%% md\n" } }, "cell_type": "markdown", "source": [ "## Creating DataFrames\n", "### DataFrames from dictionaries\n", "Assume we did some image processing and have some results available in a dictionary that contains lists of numbers:" ], "id": "69ec023821c17ef8" }, { "metadata": { "ExecuteTime": { "end_time": "2024-04-30T14:12:10.470810Z", "start_time": "2024-04-30T14:12:10.468673Z" }, "pycharm": { "name": "#%%\n" } }, "cell_type": "code", "source": [ "measurements = {\n", " \"labels\": [1, 2, 3],\n", " \"area\": [45, 23, 68],\n", " \"minor_axis\": [2, 4, 4],\n", " \"major_axis\": [3, 4, 5],\n", "}" ], "id": "8af6991d7441650f", "outputs": [], "execution_count": 2 }, { "metadata": { "pycharm": { "name": "#%% md\n" } }, "cell_type": "markdown", "source": [ "This data structure can be easily transferred into a DataFrame which provides additional functionalities, like visualizing it nicely:" ], "id": "62ef9d3f31248eed" }, { "metadata": { "ExecuteTime": { "end_time": "2024-04-30T14:12:10.480187Z", "start_time": "2024-04-30T14:12:10.471517Z" }, "pycharm": { "name": "#%%\n" } }, "cell_type": "code", "source": [ "df = pd.DataFrame(data=measurements)\n", "df" ], "id": "eaf8d6b232e10cb1", "outputs": [ { "data": { "text/plain": [ " labels area minor_axis major_axis\n", "0 1 45 2 3\n", "1 2 23 4 4\n", "2 3 68 4 5" ], "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labelsareaminor_axismajor_axis
014523
122344
236845
\n", "
" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 3 }, { "metadata": { "pycharm": { "name": "#%% md\n" } }, "cell_type": "markdown", "source": [ "DataFrames also provide some important methods to get an overview of the DataFrames' structure and the containing data:" ], "id": "70d15d36c3c369f3" }, { "metadata": { "ExecuteTime": { "end_time": "2024-04-30T14:12:10.488446Z", "start_time": "2024-04-30T14:12:10.481456Z" }, "pycharm": { "name": "#%%\n" } }, "cell_type": "code", "source": [ "# Show structure of the DataFrame\n", "df.info()" ], "id": "c37332d683c24ff6", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 3 entries, 0 to 2\n", "Data columns (total 4 columns):\n", " # Column Non-Null Count Dtype\n", "--- ------ -------------- -----\n", " 0 labels 3 non-null int64\n", " 1 area 3 non-null int64\n", " 2 minor_axis 3 non-null int64\n", " 3 major_axis 3 non-null int64\n", "dtypes: int64(4)\n", "memory usage: 224.0 bytes\n" ] } ], "execution_count": 4 }, { "metadata": { "ExecuteTime": { "end_time": "2024-04-30T14:12:10.497340Z", "start_time": "2024-04-30T14:12:10.489134Z" }, "pycharm": { "name": "#%%\n" } }, "cell_type": "code", "source": [ "# Show basic descriptive statistics of the containing data\n", "df.describe().transpose()" ], "id": "145d80feaa3417e6", "outputs": [ { "data": { "text/plain": [ " count mean std min 25% 50% 75% max\n", "labels 3.0 2.000000 1.000000 1.0 1.5 2.0 2.5 3.0\n", "area 3.0 45.333333 22.501852 23.0 34.0 45.0 56.5 68.0\n", "minor_axis 3.0 3.333333 1.154701 2.0 3.0 4.0 4.0 4.0\n", "major_axis 3.0 4.000000 1.000000 3.0 3.5 4.0 4.5 5.0" ], "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countmeanstdmin25%50%75%max
labels3.02.0000001.0000001.01.52.02.53.0
area3.045.33333322.50185223.034.045.056.568.0
minor_axis3.03.3333331.1547012.03.04.04.04.0
major_axis3.04.0000001.0000003.03.54.04.55.0
\n", "
" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 5 }, { "metadata": { "pycharm": { "name": "#%% md\n" } }, "cell_type": "markdown", "source": [ "### DataFrames from (nested) lists\n", "Sometimes, we are confronted with data in the form of nested lists (lists of lists). To make pandas understand that form of data correctly, we also need to provide the headers in the same order as the lists. This can be the rows' names (index), or the columns' names, or both." ], "id": "98295d06567c5e8c" }, { "metadata": { "ExecuteTime": { "end_time": "2024-04-30T14:12:10.501828Z", "start_time": "2024-04-30T14:12:10.498022Z" }, "pycharm": { "name": "#%%\n" } }, "cell_type": "code", "source": [ "headers = ['labels', 'area', 'minor_axis', 'major_axis']\n", "\n", "data = [\n", " [1, 2, 3],\n", " [45, 23, 68],\n", " [2, 4, 4],\n", " [3, 4, 5],\n", "]\n", "\n", "# convert the data and header arrays in a pandas data frame and show it\n", "df = pd.DataFrame(data, headers)\n", "df" ], "id": "6fcfe3eb61d88ab4", "outputs": [ { "data": { "text/plain": [ " 0 1 2\n", "labels 1 2 3\n", "area 45 23 68\n", "minor_axis 2 4 4\n", "major_axis 3 4 5" ], "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
labels123
area452368
minor_axis244
major_axis345
\n", "
" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 6 }, { "metadata": { "pycharm": { "name": "#%% md\n" } }, "cell_type": "markdown", "source": [ "As you can see, this table is rotated and our column names are set as index (rows' names). We can bring it in the usual form via transposing:" ], "id": "425964ba72a49d07" }, { "metadata": { "ExecuteTime": { "end_time": "2024-04-30T14:12:10.505596Z", "start_time": "2024-04-30T14:12:10.502569Z" }, "pycharm": { "name": "#%%\n" } }, "cell_type": "code", "source": [ "df = df.transpose()\n", "df" ], "id": "f00881e20091cfd", "outputs": [ { "data": { "text/plain": [ " labels area minor_axis major_axis\n", "0 1 45 2 3\n", "1 2 23 4 4\n", "2 3 68 4 5" ], "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labelsareaminor_axismajor_axis
014523
122344
236845
\n", "
" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 7 }, { "metadata": { "pycharm": { "name": "#%% md\n" } }, "cell_type": "markdown", "source": [ "## Saving and loading DataFrames\n", "We can save this DataFrames for continuing to work with it. We chose to save it as a CSV file, where CSV stands for comma-separated value. This is a text file that can easily be read into data structures in many programming languages.\n", "\n", "You should generally always store your data in such an open, compatible format with well-defined specifications, (not necessarily CSV), and is readable in many contexts. Excel files do not meet these criteria, neither do .mat files." ], "id": "3c48b6f9c3f8ddc" }, { "metadata": { "ExecuteTime": { "end_time": "2024-04-30T14:12:10.509091Z", "start_time": "2024-04-30T14:12:10.506322Z" }, "pycharm": { "name": "#%%\n" } }, "cell_type": "code", "source": [ "df.to_csv('../data/measurements_table.csv')" ], "id": "3671c5cc6e5e78ca", "outputs": [], "execution_count": 8 }, { "metadata": { "pycharm": { "name": "#%% md\n" } }, "cell_type": "markdown", "source": [ "Stored DataFrames can be read with pandas from various formats, so we are able to read files into a DataFrame again.\n", "\n", "See also:\n", "* [Input and Output with pandas](https://pandas.pydata.org/pandas-docs/stable/reference/io.html)" ], "id": "bd2bc8f3500837e5" }, { "metadata": { "ExecuteTime": { "end_time": "2024-04-30T14:12:10.517187Z", "start_time": "2024-04-30T14:12:10.509781Z" }, "pycharm": { "name": "#%%\n" } }, "cell_type": "code", "source": [ "# Since we know that the column at position 0 is our index, we shall provide this information\n", "df_new = pd.read_csv('../data/blobs_statistics.csv', index_col=0)\n", "df_new" ], "id": "a5a55abb1fb3c53a", "outputs": [ { "data": { "text/plain": [ " area mean_intensity minor_axis_length major_axis_length eccentricity \\\n", "0 422 192.379147 16.488550 34.566789 0.878900 \n", "1 182 180.131868 11.736074 20.802697 0.825665 \n", "2 661 205.216339 28.409502 30.208433 0.339934 \n", "3 437 216.585812 23.143996 24.606130 0.339576 \n", "4 476 212.302521 19.852882 31.075106 0.769317 \n", ".. ... ... ... ... ... \n", "56 211 185.061611 14.522762 18.489138 0.618893 \n", "57 78 185.230769 6.028638 17.579799 0.939361 \n", "58 86 183.720930 5.426871 21.261427 0.966876 \n", "59 51 190.431373 5.032414 13.742079 0.930534 \n", "60 46 175.304348 3.803982 15.948714 0.971139 \n", "\n", " extent feret_diameter_max equivalent_diameter_area bbox-0 bbox-1 \\\n", "0 0.586111 35.227830 23.179885 0 11 \n", "1 0.787879 21.377558 15.222667 0 53 \n", "2 0.874339 32.756679 29.010538 0 95 \n", "3 0.826087 26.925824 23.588253 0 144 \n", "4 0.863884 31.384710 24.618327 0 237 \n", ".. ... ... ... ... ... \n", "56 0.781481 18.973666 16.390654 232 39 \n", "57 0.722222 18.027756 9.965575 248 170 \n", "58 0.781818 22.000000 10.464158 249 117 \n", "59 0.728571 14.035669 8.058239 249 228 \n", "60 0.766667 15.033296 7.653040 250 67 \n", "\n", " bbox-2 bbox-3 \n", "0 30 35 \n", "1 11 74 \n", "2 28 122 \n", "3 23 167 \n", "4 29 256 \n", ".. ... ... \n", "56 250 54 \n", "57 254 188 \n", "58 254 139 \n", "59 254 242 \n", "60 254 82 \n", "\n", "[61 rows x 12 columns]" ], "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
areamean_intensityminor_axis_lengthmajor_axis_lengtheccentricityextentferet_diameter_maxequivalent_diameter_areabbox-0bbox-1bbox-2bbox-3
0422192.37914716.48855034.5667890.8789000.58611135.22783023.1798850113035
1182180.13186811.73607420.8026970.8256650.78787921.37755815.2226670531174
2661205.21633928.40950230.2084330.3399340.87433932.75667929.01053809528122
3437216.58581223.14399624.6061300.3395760.82608726.92582423.588253014423167
4476212.30252119.85288231.0751060.7693170.86388431.38471024.618327023729256
.......................................
56211185.06161114.52276218.4891380.6188930.78148118.97366616.3906542323925054
5778185.2307696.02863817.5797990.9393610.72222218.0277569.965575248170254188
5886183.7209305.42687121.2614270.9668760.78181822.00000010.464158249117254139
5951190.4313735.03241413.7420790.9305340.72857114.0356698.058239249228254242
6046175.3043483.80398215.9487140.9711390.76666715.0332967.6530402506725482
\n", "

61 rows × 12 columns

\n", "
" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 9 } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 5 }