Introduction to working with DataFrames

Introduction to working with DataFrames#

In basic Python, we often use dictionaries containing data as vectors, like our measurements. While these basic structures are convenient for collecting data, they are suboptimal for advanced data processing. For that, we introduce pandas, one of the primary tools in the Python Data Science ecosystem for handling data. Pandas’ primary object, the “DataFrame”, is a 2-dimensional labeled data structure with columns of different data types, and is extremely useful in wrangling data.

Creating DataFrames#

DataFrames from dictionaries#

Assume we did some image processing and have some results available in a dictionary that contains lists of numbers:

measurements = {
    "labels": [1, 2, 3],
    "area": [45, 23, 68],
    "minor_axis": [2, 4, 4],
    "major_axis": [3, 4, 5],
}

This data structure can be easily transferred into a DataFrame which provides additional functionalities, like visualizing it nicely:

df = pd.DataFrame(data=measurements)
df

	labels	area	minor_axis	major_axis
0	1	45	2	3
1	2	23	4	4
2	3	68	4	5

DataFrames also provide some important methods to get an overview of the DataFrames’ structure and the containing data:

# Show structure of the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   labels      3 non-null      int64
 1   area        3 non-null      int64
 2   minor_axis  3 non-null      int64
 3   major_axis  3 non-null      int64
dtypes: int64(4)
memory usage: 224.0 bytes

# Show basic descriptive statistics of the containing data
df.describe().transpose()

	count	mean	std	min	25%	50%	75%	max
labels	3.0	2.000000	1.000000	1.0	1.5	2.0	2.5	3.0
area	3.0	45.333333	22.501852	23.0	34.0	45.0	56.5	68.0
minor_axis	3.0	3.333333	1.154701	2.0	3.0	4.0	4.0	4.0
major_axis	3.0	4.000000	1.000000	3.0	3.5	4.0	4.5	5.0

DataFrames from (nested) lists#

Sometimes, we are confronted with data in the form of nested lists (lists of lists). To make pandas understand that form of data correctly, we also need to provide the headers in the same order as the lists. This can be the rows’ names (index), or the columns’ names, or both.

headers = ['labels', 'area', 'minor_axis', 'major_axis']

data = [
    [1, 2, 3],
    [45, 23, 68],
    [2, 4, 4],
    [3, 4, 5],
]

# convert the data and header arrays in a pandas data frame and show it
df = pd.DataFrame(data, headers)
df

	0	1	2
labels	1	2	3
area	45	23	68
minor_axis	2	4	4
major_axis	3	4	5

As you can see, this table is rotated and our column names are set as index (rows’ names). We can bring it in the usual form via transposing:

df = df.transpose()
df

	labels	area	minor_axis	major_axis
0	1	45	2	3
1	2	23	4	4
2	3	68	4	5

Saving and loading DataFrames#

We can save this DataFrames for continuing to work with it. We chose to save it as a CSV file, where CSV stands for comma-separated value. This is a text file that can easily be read into data structures in many programming languages.

You should generally always store your data in such an open, compatible format with well-defined specifications, (not necessarily CSV), and is readable in many contexts. Excel files do not meet these criteria, neither do .mat files.

df.to_csv('../data/measurements_table.csv')

Stored DataFrames can be read with pandas from various formats, so we are able to read files into a DataFrame again.

See also:

Input and Output with pandas

# Since we know that the column at position 0 is our index, we shall provide this information
df_new = pd.read_csv('../data/blobs_statistics.csv', index_col=0)
df_new

	area	mean_intensity	minor_axis_length	major_axis_length	eccentricity	extent	feret_diameter_max	equivalent_diameter_area	bbox-0	bbox-1	bbox-2	bbox-3
0	422	192.379147	16.488550	34.566789	0.878900	0.586111	35.227830	23.179885	0	11	30	35
1	182	180.131868	11.736074	20.802697	0.825665	0.787879	21.377558	15.222667	0	53	11	74
2	661	205.216339	28.409502	30.208433	0.339934	0.874339	32.756679	29.010538	0	95	28	122
3	437	216.585812	23.143996	24.606130	0.339576	0.826087	26.925824	23.588253	0	144	23	167
4	476	212.302521	19.852882	31.075106	0.769317	0.863884	31.384710	24.618327	0	237	29	256
...	...	...	...	...	...	...	...	...	...	...	...	...
56	211	185.061611	14.522762	18.489138	0.618893	0.781481	18.973666	16.390654	232	39	250	54
57	78	185.230769	6.028638	17.579799	0.939361	0.722222	18.027756	9.965575	248	170	254	188
58	86	183.720930	5.426871	21.261427	0.966876	0.781818	22.000000	10.464158	249	117	254	139
59	51	190.431373	5.032414	13.742079	0.930534	0.728571	14.035669	8.058239	249	228	254	242
60	46	175.304348	3.803982	15.948714	0.971139	0.766667	15.033296	7.653040	250	67	254	82

61 rows × 12 columns