Selecting data from DataFrames#

Like with Python sequences, we may want to get specific parts of the data in a DataFrame. And like with Python sequences, we can use indexing, slicing, and masking on DataFrames, too.

See also:

We start by reading our statistics CSV file into a DataFrame again

import pandas as pd
df = pd.read_csv('../data/blobs_statistics.csv', index_col=0)
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 61 entries, 0 to 60
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   area                      61 non-null     int64  
 1   mean_intensity            61 non-null     float64
 2   minor_axis_length         61 non-null     float64
 3   major_axis_length         61 non-null     float64
 4   eccentricity              61 non-null     float64
 5   extent                    61 non-null     float64
 6   feret_diameter_max        61 non-null     float64
 7   equivalent_diameter_area  61 non-null     float64
 8   bbox-0                    61 non-null     int64  
 9   bbox-1                    61 non-null     int64  
 10  bbox-2                    61 non-null     int64  
 11  bbox-3                    61 non-null     int64  
dtypes: float64(7), int64(5)
memory usage: 6.2 KB

Pandas provides methods to display just the first n rows or last n rows of a DataFrame - head() and tail()

df.head(10)
area mean_intensity minor_axis_length major_axis_length eccentricity extent feret_diameter_max equivalent_diameter_area bbox-0 bbox-1 bbox-2 bbox-3
0 422 192.379147 16.488550 34.566789 0.878900 0.586111 35.227830 23.179885 0 11 30 35
1 182 180.131868 11.736074 20.802697 0.825665 0.787879 21.377558 15.222667 0 53 11 74
2 661 205.216339 28.409502 30.208433 0.339934 0.874339 32.756679 29.010538 0 95 28 122
3 437 216.585812 23.143996 24.606130 0.339576 0.826087 26.925824 23.588253 0 144 23 167
4 476 212.302521 19.852882 31.075106 0.769317 0.863884 31.384710 24.618327 0 237 29 256
5 277 206.469314 17.523565 20.151370 0.493763 0.769444 20.880613 18.779972 6 189 26 207
6 259 178.007722 15.309080 21.761302 0.710695 0.735795 22.803509 18.159544 17 211 39 227
7 219 191.598174 15.745458 17.729017 0.459616 0.760417 18.681542 16.698487 18 37 36 53
8 67 167.522388 8.731268 9.761752 0.447195 0.827160 10.295630 9.236182 18 133 27 142
9 19 155.368421 1.967201 11.358057 0.984887 0.863636 11.000000 4.918491 21 0 32 2

We can also get the column names via columns. This will provide a so-called Index object with the column names and some metadata

df.columns
Index(['area', 'mean_intensity', 'minor_axis_length', 'major_axis_length',
       'eccentricity', 'extent', 'feret_diameter_max',
       'equivalent_diameter_area', 'bbox-0', 'bbox-1', 'bbox-2', 'bbox-3'],
      dtype='object')

Selecting rows and columns#

Let’s get the DataFrames first row

df[0]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/Documents/Miniconda/envs/devbio-napari-env/lib/python3.9/site-packages/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key)
   3804 try:
-> 3805     return self._engine.get_loc(casted_key)
   3806 except KeyError as err:

File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[5], line 1
----> 1 df[0]

File ~/Documents/Miniconda/envs/devbio-napari-env/lib/python3.9/site-packages/pandas/core/frame.py:4090, in DataFrame.__getitem__(self, key)
   4088 if self.columns.nlevels > 1:
   4089     return self._getitem_multilevel(key)
-> 4090 indexer = self.columns.get_loc(key)
   4091 if is_integer(indexer):
   4092     indexer = [indexer]

File ~/Documents/Miniconda/envs/devbio-napari-env/lib/python3.9/site-packages/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
   3807     if isinstance(casted_key, slice) or (
   3808         isinstance(casted_key, abc.Iterable)
   3809         and any(isinstance(x, slice) for x in casted_key)
   3810     ):
   3811         raise InvalidIndexError(key)
-> 3812     raise KeyError(key) from err
   3813 except TypeError:
   3814     # If we have a listlike key, _check_indexing_error will raise
   3815     #  InvalidIndexError. Otherwise we fall through and re-raise
   3816     #  the TypeError.
   3817     self._check_indexing_error(key)

KeyError: 0

Ooops…this raised a KeyError. For DataFrames, the [] operator expects labels only, and works on the columns, not the rows. We can use it to get a subset of the DataFrame with the specified column labels:

df['area']
0     422
1     182
2     661
3     437
4     476
     ... 
56    211
57     78
58     86
59     51
60     46
Name: area, Length: 61, dtype: int64

Notice that the resulting data structure has the column data and an index, too. That’s because Pandas DataFrames use Pandas Series for the columns, which are one-dimensional arrays with an index for the rows and a label for the column data. It can be seen as a single-column table.

We can get more columns by passing their names as a list. Furthermore, we can store this “sub-dataframe” in a new variable.

df_subset = df[['area', 'mean_intensity']]
df_subset
area mean_intensity
0 422 192.379147
1 182 180.131868
2 661 205.216339
3 437 216.585812
4 476 212.302521
... ... ...
56 211 185.061611
57 78 185.230769
58 86 183.720930
59 51 190.431373
60 46 175.304348

61 rows × 2 columns

Label-location based selection#

If we want to select specific rows, we need to use the method loc[]. Note, that loc[] also works with labels only. Let’s get the row with label 0 (its index label).

df.loc[0]
area                        422.000000
mean_intensity              192.379147
minor_axis_length            16.488550
major_axis_length            34.566789
eccentricity                  0.878900
extent                        0.586111
feret_diameter_max           35.227830
equivalent_diameter_area     23.179885
bbox-0                        0.000000
bbox-1                       11.000000
bbox-2                       30.000000
bbox-3                       35.000000
Name: 0, dtype: float64

We can also select more than one row by providing the row labels as a list. In addition, we can select specific columns, too:

df.loc[[0,2], ['area', 'mean_intensity']]
area mean_intensity
0 422 192.379147
2 661 205.216339

We can also use slicing via [from:to] to get data starting from a specific label up to a specific label.

df.loc[0:2, 'area':'extent']
area mean_intensity minor_axis_length major_axis_length eccentricity extent
0 422 192.379147 16.488550 34.566789 0.878900 0.586111
1 182 180.131868 11.736074 20.802697 0.825665 0.787879
2 661 205.216339 28.409502 30.208433 0.339934 0.874339

Attention: Slicing on label-location requires that the order of the labels in the row and column index is known and taken into account:

df.loc[0:2, 'extent':'area']
0
1
2

Integer-location based selection#

If we want to select data based on the indices’ integer positions, we need to use iloc[]. Let’s get the value from the first row and the first column:

df.iloc[0, 0]
422

We can use slicing to get specific parts of the data. Let’s get the first 10 rows and the first 5 columns:

df.iloc[0:10, 0:5]
area mean_intensity minor_axis_length major_axis_length eccentricity
0 422 192.379147 16.488550 34.566789 0.878900
1 182 180.131868 11.736074 20.802697 0.825665
2 661 205.216339 28.409502 30.208433 0.339934
3 437 216.585812 23.143996 24.606130 0.339576
4 476 212.302521 19.852882 31.075106 0.769317
5 277 206.469314 17.523565 20.151370 0.493763
6 259 178.007722 15.309080 21.761302 0.710695
7 219 191.598174 15.745458 17.729017 0.459616
8 67 167.522388 8.731268 9.761752 0.447195
9 19 155.368421 1.967201 11.358057 0.984887

Let’s get everything starting from the 10th row and starting from the 5th column:

df.iloc[10:, 5:]
extent feret_diameter_max equivalent_diameter_area bbox-0 bbox-1 bbox-2 bbox-3
10 0.747692 26.925824 24.875579 21 162 47 187
11 0.725806 33.837849 28.322092 26 59 57 87
12 0.767361 18.973666 16.774562 39 225 57 241
13 0.787879 11.704700 9.965575 40 4 51 13
14 0.784965 26.925824 23.909926 42 128 68 150
15 0.767857 28.861739 25.631847 44 17 72 41
16 0.599078 34.014703 22.283703 44 184 75 205
17 0.775926 28.160256 23.097346 60 89 87 109
18 0.747899 22.472205 18.437868 60 206 81 223
19 0.800454 24.186773 21.200320 63 235 84 256
20 0.774359 16.155494 13.865755 66 162 81 175
21 0.790514 24.698178 22.567583 73 53 96 75
22 0.676190 31.953091 23.289484 75 117 96 147
23 0.809211 19.924859 17.697936 77 26 96 42
24 0.827303 32.756679 25.306906 82 0 114 19
25 0.812865 21.023796 18.813840 86 214 105 232
26 0.770362 35.468296 29.446156 88 157 114 191
27 0.800000 20.880613 14.969641 95 245 115 256
28 0.819222 24.351591 21.349936 102 92 121 115
29 0.755556 30.594117 26.318099 108 25 138 49
30 0.552778 44.721360 27.570347 110 113 150 140
31 0.754167 17.088007 15.180789 115 56 131 71
32 0.619704 38.587563 28.299606 115 151 150 180
33 0.790451 29.832868 27.547246 115 210 141 239
34 0.833333 3.162278 2.523133 125 252 127 255
35 0.821875 21.189620 18.299235 130 0 150 16
36 0.665926 52.201533 33.832563 137 75 187 102
37 0.689855 34.985711 24.618327 137 233 167 256
38 0.766447 19.924859 17.223960 138 59 157 75
39 0.788462 17.464249 14.450304 149 182 165 195
40 0.778656 26.627054 22.397687 153 125 175 148
41 0.778409 26.925824 22.875783 157 5 181 27
42 0.815972 19.313208 17.297725 158 206 176 222
43 0.789474 25.495098 21.850969 160 35 185 54
44 0.807407 32.310989 28.856519 166 167 196 194
45 0.776860 24.596748 21.880084 174 222 196 244
46 0.795330 29.154759 27.151532 185 116 211 144
47 0.800000 16.031220 9.027033 193 251 209 256
48 0.774038 17.088007 14.317527 195 200 211 213
49 0.692424 34.205263 24.121991 199 17 232 37
50 0.718391 32.756679 28.209479 199 89 229 118
51 0.775362 30.594117 26.099486 203 51 233 74
52 0.745455 25.179357 16.155931 214 0 239 11
53 0.771978 28.861739 26.749965 214 220 240 248
54 0.782407 40.792156 32.800723 217 161 244 201
55 0.748663 22.360680 18.881395 223 130 245 147
56 0.781481 18.973666 16.390654 232 39 250 54
57 0.722222 18.027756 9.965575 248 170 254 188
58 0.781818 22.000000 10.464158 249 117 254 139
59 0.728571 14.035669 8.058239 249 228 254 242
60 0.766667 15.033296 7.653040 250 67 254 82

To clarify the difference between label-location and integer-location again, let’s try both approaches again on a sorted dataset:

df_sorted = df.sort_values(by='area', ascending=False)
df_sorted
area mean_intensity minor_axis_length major_axis_length eccentricity extent feret_diameter_max equivalent_diameter_area bbox-0 bbox-1 bbox-2 bbox-3
36 899 198.291435 21.753901 54.500296 0.916885 0.665926 52.201533 33.832563 137 75 187 102
54 845 198.295858 26.478859 40.977288 0.763182 0.782407 40.792156 32.800723 217 161 244 201
26 681 198.308370 24.756299 36.310074 0.731536 0.770362 35.468296 29.446156 88 157 114 191
2 661 205.216339 28.409502 30.208433 0.339934 0.874339 32.756679 29.010538 0 95 28 122
44 654 199.706422 27.176616 30.914646 0.476664 0.807407 32.310989 28.856519 166 167 196 194
... ... ... ... ... ... ... ... ... ... ... ... ...
47 64 190.250000 5.354655 15.925594 0.941780 0.800000 16.031220 9.027033 193 251 209 256
59 51 190.431373 5.032414 13.742079 0.930534 0.728571 14.035669 8.058239 249 228 254 242
60 46 175.304348 3.803982 15.948714 0.971139 0.766667 15.033296 7.653040 250 67 254 82
9 19 155.368421 1.967201 11.358057 0.984887 0.863636 11.000000 4.918491 21 0 32 2
34 5 161.600000 1.788854 3.098387 0.816497 0.833333 3.162278 2.523133 125 252 127 255

61 rows × 12 columns

As we can see above, the index labels are not in order anymore. Let’s apply loc[] and iloc[] again, to get the “first three rows”:

# label-location
df_sorted.loc[[0,1,2]]
area mean_intensity minor_axis_length major_axis_length eccentricity extent feret_diameter_max equivalent_diameter_area bbox-0 bbox-1 bbox-2 bbox-3
0 422 192.379147 16.488550 34.566789 0.878900 0.586111 35.227830 23.179885 0 11 30 35
1 182 180.131868 11.736074 20.802697 0.825665 0.787879 21.377558 15.222667 0 53 11 74
2 661 205.216339 28.409502 30.208433 0.339934 0.874339 32.756679 29.010538 0 95 28 122
# integer-location
df_sorted.iloc[0:3]
area mean_intensity minor_axis_length major_axis_length eccentricity extent feret_diameter_max equivalent_diameter_area bbox-0 bbox-1 bbox-2 bbox-3
36 899 198.291435 21.753901 54.500296 0.916885 0.665926 52.201533 33.832563 137 75 187 102
54 845 198.295858 26.478859 40.977288 0.763182 0.782407 40.792156 32.800723 217 161 244 201
26 681 198.308370 24.756299 36.310074 0.731536 0.770362 35.468296 29.446156 88 157 114 191

Boolean indexing#

With boolean indexing (or masking), we can select specific parts of the data based on conditional / logical queries. Suppose we want to get entries for which “area” > 500, we can use this condition in the [] operator.

mask = df["area"] > 500
mask
0     False
1     False
2      True
3     False
4     False
      ...  
56    False
57    False
58    False
59    False
60    False
Name: area, Length: 61, dtype: bool
# apply the mask to the DataFrame: df[mask]
# Or as one-liner:
df[df["area"] > 500]
area mean_intensity minor_axis_length major_axis_length eccentricity extent feret_diameter_max equivalent_diameter_area bbox-0 bbox-1 bbox-2 bbox-3
2 661 205.216339 28.409502 30.208433 0.339934 0.874339 32.756679 29.010538 0 95 28 122
11 630 173.600000 24.460212 33.020881 0.671781 0.725806 33.837849 28.322092 26 59 57 87
15 516 194.403101 23.585468 27.907939 0.534582 0.767857 28.861739 25.631847 44 17 72 41
24 503 198.648111 19.812612 33.323794 0.804060 0.827303 32.756679 25.306906 82 0 114 19
26 681 198.308370 24.756299 36.310074 0.731536 0.770362 35.468296 29.446156 88 157 114 191
29 544 198.455882 23.888675 29.100741 0.571078 0.755556 30.594117 26.318099 108 25 138 49
30 597 190.954774 17.078429 47.451008 0.932984 0.552778 44.721360 27.570347 110 113 150 140
32 629 193.666137 20.719000 40.268495 0.857478 0.619704 38.587563 28.299606 115 151 150 180
33 596 210.067114 26.320789 28.931798 0.415150 0.790451 29.832868 27.547246 115 210 141 239
36 899 198.291435 21.753901 54.500296 0.916885 0.665926 52.201533 33.832563 137 75 187 102
44 654 199.706422 27.176616 30.914646 0.476664 0.807407 32.310989 28.856519 166 167 196 194
46 579 200.649396 26.484866 27.884001 0.312788 0.795330 29.154759 27.151532 185 116 211 144
50 625 217.894400 24.624993 32.379307 0.649319 0.718391 32.756679 28.209479 199 89 229 118
51 535 189.936449 22.795105 29.898229 0.647079 0.775362 30.594117 26.099486 203 51 233 74
53 562 215.928826 26.012538 27.607523 0.334977 0.771978 28.861739 26.749965 214 220 240 248
54 845 198.295858 26.478859 40.977288 0.763182 0.782407 40.792156 32.800723 217 161 244 201

We can also use this syntax with loc[] to return specific columns only, e.g., “area” and “mean_intensity”.

df.loc[(df["area"] > 500), ['area', 'mean_intensity']]
area mean_intensity
2 661 205.216339
11 630 173.600000
15 516 194.403101
24 503 198.648111
26 681 198.308370
29 544 198.455882
30 597 190.954774
32 629 193.666137
33 596 210.067114
36 899 198.291435
44 654 199.706422
46 579 200.649396
50 625 217.894400
51 535 189.936449
53 562 215.928826
54 845 198.295858

Get scalar values#

If we only want to access a single scalar value, the fastest way is to use the at and iat methods.

df.at[2, "area"]
661
df.iat[0, 0]
422

Math with DataFrames#

Now that we know how to select specific parts of a DataFrame, we want to do some math with it. Mathematical operations can be executed directly on a DataFrame or selected parts of it. We can also easily add new columns to the DataFrame to store computed results.

Let’s assume we want to compute the ratio of the two axes by dividing the major axis by the minor axis and store this result in a new colum named axis_ratio. We can just define that new column and use the basic mathematical operator on the existing columns to fill it:

df['axis_ratio'] = df['major_axis_length'] / df['minor_axis_length']
df['axis_ratio']
0     2.096412
1     1.772543
2     1.063321
3     1.063176
4     1.565269
        ...   
56    1.273114
57    2.916048
58    3.917806
59    2.730713
60    4.192637
Name: axis_ratio, Length: 61, dtype: float64

A DataFrame and the underlying Series objects holding the column data also provide lots of built-in methods for mathematical operations and basic statistics via the . operator.

See also:

# Compute the mean for all columns in the DataFrame
df.mean()
area                        358.426230
mean_intensity              191.966564
minor_axis_length            17.127032
major_axis_length            24.796851
eccentricity                  0.657902
extent                        0.764751
feret_diameter_max           25.323368
equivalent_diameter_area     20.201434
bbox-0                      112.786885
bbox-1                      122.245902
bbox-2                      134.901639
bbox-3                      142.131148
axis_ratio                    1.650854
dtype: float64
# Get the max value from a column
df['area'].max()
899

Exercise#

From the loaded CSV file, create a table that only contains these columns:

  • minor_axis_length

  • major_axis_length

  • mean_intensity

Also, select all rows where value mean_intensity is smaller than the mean of the column mean_intensity.

df_loaded = pd.read_csv('../data/blobs_statistics.csv', index_col=0)