Selecting data from DataFrames

Selecting data from DataFrames#

Like with Python sequences, we may want to get specific parts of the data in a DataFrame. And like with Python sequences, we can use indexing, slicing, and masking on DataFrames, too.

Selecting rows and columns#

Let’s get the DataFrames first row

df[0]

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/Documents/Miniconda/envs/devbio-napari-env/lib/python3.9/site-packages/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key)
   3804 try:
-> 3805     return self._engine.get_loc(casted_key)
   3806 except KeyError as err:

File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[5], line 1
----> 1 df[0]

File ~/Documents/Miniconda/envs/devbio-napari-env/lib/python3.9/site-packages/pandas/core/frame.py:4090, in DataFrame.__getitem__(self, key)
   4088 if self.columns.nlevels > 1:
   4089     return self._getitem_multilevel(key)
-> 4090 indexer = self.columns.get_loc(key)
   4091 if is_integer(indexer):
   4092     indexer = [indexer]

File ~/Documents/Miniconda/envs/devbio-napari-env/lib/python3.9/site-packages/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
   3807     if isinstance(casted_key, slice) or (
   3808         isinstance(casted_key, abc.Iterable)
   3809         and any(isinstance(x, slice) for x in casted_key)
   3810     ):
   3811         raise InvalidIndexError(key)
-> 3812     raise KeyError(key) from err
   3813 except TypeError:
   3814     # If we have a listlike key, _check_indexing_error will raise
   3815     #  InvalidIndexError. Otherwise we fall through and re-raise
   3816     #  the TypeError.
   3817     self._check_indexing_error(key)

KeyError: 0

Ooops…this raised a KeyError. For DataFrames, the [] operator expects labels only, and works on the columns, not the rows. We can use it to get a subset of the DataFrame with the specified column labels:

df['area']

   422
   182
   661
   437
   476
     ... 
  211
   78
   86
   51
   46
Name: area, Length: 61, dtype: int64

Notice that the resulting data structure has the column data and an index, too. That’s because Pandas DataFrames use Pandas Series for the columns, which are one-dimensional arrays with an index for the rows and a label for the column data. It can be seen as a single-column table.

We can get more columns by passing their names as a list. Furthermore, we can store this “sub-dataframe” in a new variable.

df_subset = df[['area', 'mean_intensity']]
df_subset

	area	mean_intensity
0	422	192.379147
1	182	180.131868
2	661	205.216339
3	437	216.585812
4	476	212.302521
...	...	...
56	211	185.061611
57	78	185.230769
58	86	183.720930
59	51	190.431373
60	46	175.304348

61 rows × 2 columns

Label-location based selection#

If we want to select specific rows, we need to use the method loc[]. Note, that loc[] also works with labels only. Let’s get the row with label 0 (its index label).

df.loc[0]

area                        422.000000
mean_intensity              192.379147
minor_axis_length            16.488550
major_axis_length            34.566789
eccentricity                  0.878900
extent                        0.586111
feret_diameter_max           35.227830
equivalent_diameter_area     23.179885
bbox-0                        0.000000
bbox-1                       11.000000
bbox-2                       30.000000
bbox-3                       35.000000
Name: 0, dtype: float64

We can also select more than one row by providing the row labels as a list. In addition, we can select specific columns, too:

df.loc[[0,2], ['area', 'mean_intensity']]

	area	mean_intensity
0	422	192.379147
2	661	205.216339

We can also use slicing via [from:to] to get data starting from a specific label up to a specific label.

df.loc[0:2, 'area':'extent']

	area	mean_intensity	minor_axis_length	major_axis_length	eccentricity	extent
0	422	192.379147	16.488550	34.566789	0.878900	0.586111
1	182	180.131868	11.736074	20.802697	0.825665	0.787879
2	661	205.216339	28.409502	30.208433	0.339934	0.874339

Attention: Slicing on label-location requires that the order of the labels in the row and column index is known and taken into account:

df.loc[0:2, 'extent':'area']


0
1
2

Integer-location based selection#

If we want to select data based on the indices’ integer positions, we need to use iloc[]. Let’s get the value from the first row and the first column:

df.iloc[0, 0]

We can use slicing to get specific parts of the data. Let’s get the first 10 rows and the first 5 columns:

df.iloc[0:10, 0:5]

	area	mean_intensity	minor_axis_length	major_axis_length	eccentricity
0	422	192.379147	16.488550	34.566789	0.878900
1	182	180.131868	11.736074	20.802697	0.825665
2	661	205.216339	28.409502	30.208433	0.339934
3	437	216.585812	23.143996	24.606130	0.339576
4	476	212.302521	19.852882	31.075106	0.769317
5	277	206.469314	17.523565	20.151370	0.493763
6	259	178.007722	15.309080	21.761302	0.710695
7	219	191.598174	15.745458	17.729017	0.459616
8	67	167.522388	8.731268	9.761752	0.447195
9	19	155.368421	1.967201	11.358057	0.984887

Let’s get everything starting from the 10th row and starting from the 5th column:

df.iloc[10:, 5:]

	extent	feret_diameter_max	equivalent_diameter_area	bbox-0	bbox-1	bbox-2	bbox-3
10	0.747692	26.925824	24.875579	21	162	47	187
11	0.725806	33.837849	28.322092	26	59	57	87
12	0.767361	18.973666	16.774562	39	225	57	241
13	0.787879	11.704700	9.965575	40	4	51	13
14	0.784965	26.925824	23.909926	42	128	68	150
15	0.767857	28.861739	25.631847	44	17	72	41
16	0.599078	34.014703	22.283703	44	184	75	205
17	0.775926	28.160256	23.097346	60	89	87	109
18	0.747899	22.472205	18.437868	60	206	81	223
19	0.800454	24.186773	21.200320	63	235	84	256
20	0.774359	16.155494	13.865755	66	162	81	175
21	0.790514	24.698178	22.567583	73	53	96	75
22	0.676190	31.953091	23.289484	75	117	96	147
23	0.809211	19.924859	17.697936	77	26	96	42
24	0.827303	32.756679	25.306906	82	0	114	19
25	0.812865	21.023796	18.813840	86	214	105	232
26	0.770362	35.468296	29.446156	88	157	114	191
27	0.800000	20.880613	14.969641	95	245	115	256
28	0.819222	24.351591	21.349936	102	92	121	115
29	0.755556	30.594117	26.318099	108	25	138	49
30	0.552778	44.721360	27.570347	110	113	150	140
31	0.754167	17.088007	15.180789	115	56	131	71
32	0.619704	38.587563	28.299606	115	151	150	180
33	0.790451	29.832868	27.547246	115	210	141	239
34	0.833333	3.162278	2.523133	125	252	127	255
35	0.821875	21.189620	18.299235	130	0	150	16
36	0.665926	52.201533	33.832563	137	75	187	102
37	0.689855	34.985711	24.618327	137	233	167	256
38	0.766447	19.924859	17.223960	138	59	157	75
39	0.788462	17.464249	14.450304	149	182	165	195
40	0.778656	26.627054	22.397687	153	125	175	148
41	0.778409	26.925824	22.875783	157	5	181	27
42	0.815972	19.313208	17.297725	158	206	176	222
43	0.789474	25.495098	21.850969	160	35	185	54
44	0.807407	32.310989	28.856519	166	167	196	194
45	0.776860	24.596748	21.880084	174	222	196	244
46	0.795330	29.154759	27.151532	185	116	211	144
47	0.800000	16.031220	9.027033	193	251	209	256
48	0.774038	17.088007	14.317527	195	200	211	213
49	0.692424	34.205263	24.121991	199	17	232	37
50	0.718391	32.756679	28.209479	199	89	229	118
51	0.775362	30.594117	26.099486	203	51	233	74
52	0.745455	25.179357	16.155931	214	0	239	11
53	0.771978	28.861739	26.749965	214	220	240	248
54	0.782407	40.792156	32.800723	217	161	244	201
55	0.748663	22.360680	18.881395	223	130	245	147
56	0.781481	18.973666	16.390654	232	39	250	54
57	0.722222	18.027756	9.965575	248	170	254	188
58	0.781818	22.000000	10.464158	249	117	254	139
59	0.728571	14.035669	8.058239	249	228	254	242
60	0.766667	15.033296	7.653040	250	67	254	82

To clarify the difference between label-location and integer-location again, let’s try both approaches again on a sorted dataset:

df_sorted = df.sort_values(by='area', ascending=False)
df_sorted

	area	mean_intensity	minor_axis_length	major_axis_length	eccentricity	extent	feret_diameter_max	equivalent_diameter_area	bbox-0	bbox-1	bbox-2	bbox-3
36	899	198.291435	21.753901	54.500296	0.916885	0.665926	52.201533	33.832563	137	75	187	102
54	845	198.295858	26.478859	40.977288	0.763182	0.782407	40.792156	32.800723	217	161	244	201
26	681	198.308370	24.756299	36.310074	0.731536	0.770362	35.468296	29.446156	88	157	114	191
2	661	205.216339	28.409502	30.208433	0.339934	0.874339	32.756679	29.010538	0	95	28	122
44	654	199.706422	27.176616	30.914646	0.476664	0.807407	32.310989	28.856519	166	167	196	194
...	...	...	...	...	...	...	...	...	...	...	...	...
47	64	190.250000	5.354655	15.925594	0.941780	0.800000	16.031220	9.027033	193	251	209	256
59	51	190.431373	5.032414	13.742079	0.930534	0.728571	14.035669	8.058239	249	228	254	242
60	46	175.304348	3.803982	15.948714	0.971139	0.766667	15.033296	7.653040	250	67	254	82
9	19	155.368421	1.967201	11.358057	0.984887	0.863636	11.000000	4.918491	21	0	32	2
34	5	161.600000	1.788854	3.098387	0.816497	0.833333	3.162278	2.523133	125	252	127	255

61 rows × 12 columns

As we can see above, the index labels are not in order anymore. Let’s apply loc[] and iloc[] again, to get the “first three rows”:

# label-location
df_sorted.loc[[0,1,2]]

	area	mean_intensity	minor_axis_length	major_axis_length	eccentricity	extent	feret_diameter_max	equivalent_diameter_area	bbox-1	bbox-2	bbox-3
0	422	192.379147	16.488550	34.566789	0.878900	0.586111	35.227830	23.179885	11	30	35
1	182	180.131868	11.736074	20.802697	0.825665	0.787879	21.377558	15.222667	53	11	74
2	661	205.216339	28.409502	30.208433	0.339934	0.874339	32.756679	29.010538	95	28	122

# integer-location
df_sorted.iloc[0:3]

	area	mean_intensity	minor_axis_length	major_axis_length	eccentricity	extent	feret_diameter_max	equivalent_diameter_area	bbox-0	bbox-1	bbox-2	bbox-3
36	899	198.291435	21.753901	54.500296	0.916885	0.665926	52.201533	33.832563	137	75	187	102
54	845	198.295858	26.478859	40.977288	0.763182	0.782407	40.792156	32.800723	217	161	244	201
26	681	198.308370	24.756299	36.310074	0.731536	0.770362	35.468296	29.446156	88	157	114	191

Boolean indexing#

With boolean indexing (or masking), we can select specific parts of the data based on conditional / logical queries. Suppose we want to get entries for which “area” > 500, we can use this condition in the [] operator.

mask = df["area"] > 500
mask

   False
   False
    True
   False
   False
      ...  
  False
  False
  False
  False
  False
Name: area, Length: 61, dtype: bool

# apply the mask to the DataFrame: df[mask]
# Or as one-liner:
df[df["area"] > 500]

	area	mean_intensity	minor_axis_length	major_axis_length	eccentricity	extent	feret_diameter_max	equivalent_diameter_area	bbox-0	bbox-1	bbox-2	bbox-3
2	661	205.216339	28.409502	30.208433	0.339934	0.874339	32.756679	29.010538	0	95	28	122
11	630	173.600000	24.460212	33.020881	0.671781	0.725806	33.837849	28.322092	26	59	57	87
15	516	194.403101	23.585468	27.907939	0.534582	0.767857	28.861739	25.631847	44	17	72	41
24	503	198.648111	19.812612	33.323794	0.804060	0.827303	32.756679	25.306906	82	0	114	19
26	681	198.308370	24.756299	36.310074	0.731536	0.770362	35.468296	29.446156	88	157	114	191
29	544	198.455882	23.888675	29.100741	0.571078	0.755556	30.594117	26.318099	108	25	138	49
30	597	190.954774	17.078429	47.451008	0.932984	0.552778	44.721360	27.570347	110	113	150	140
32	629	193.666137	20.719000	40.268495	0.857478	0.619704	38.587563	28.299606	115	151	150	180
33	596	210.067114	26.320789	28.931798	0.415150	0.790451	29.832868	27.547246	115	210	141	239
36	899	198.291435	21.753901	54.500296	0.916885	0.665926	52.201533	33.832563	137	75	187	102
44	654	199.706422	27.176616	30.914646	0.476664	0.807407	32.310989	28.856519	166	167	196	194
46	579	200.649396	26.484866	27.884001	0.312788	0.795330	29.154759	27.151532	185	116	211	144
50	625	217.894400	24.624993	32.379307	0.649319	0.718391	32.756679	28.209479	199	89	229	118
51	535	189.936449	22.795105	29.898229	0.647079	0.775362	30.594117	26.099486	203	51	233	74
53	562	215.928826	26.012538	27.607523	0.334977	0.771978	28.861739	26.749965	214	220	240	248
54	845	198.295858	26.478859	40.977288	0.763182	0.782407	40.792156	32.800723	217	161	244	201

We can also use this syntax with loc[] to return specific columns only, e.g., “area” and “mean_intensity”.

df.loc[(df["area"] > 500), ['area', 'mean_intensity']]

	area	mean_intensity
2	661	205.216339
11	630	173.600000
15	516	194.403101
24	503	198.648111
26	681	198.308370
29	544	198.455882
30	597	190.954774
32	629	193.666137
33	596	210.067114
36	899	198.291435
44	654	199.706422
46	579	200.649396
50	625	217.894400
51	535	189.936449
53	562	215.928826
54	845	198.295858

Get scalar values#

If we only want to access a single scalar value, the fastest way is to use the at and iat methods.

df.at[2, "area"]

df.iat[0, 0]

Math with DataFrames#

Now that we know how to select specific parts of a DataFrame, we want to do some math with it. Mathematical operations can be executed directly on a DataFrame or selected parts of it. We can also easily add new columns to the DataFrame to store computed results.

Let’s assume we want to compute the ratio of the two axes by dividing the major axis by the minor axis and store this result in a new colum named axis_ratio. We can just define that new column and use the basic mathematical operator on the existing columns to fill it:

df['axis_ratio'] = df['major_axis_length'] / df['minor_axis_length']
df['axis_ratio']

   2.096412
   1.772543
   1.063321
   1.063176
   1.565269
        ...   
  1.273114
  2.916048
  3.917806
  2.730713
  4.192637
Name: axis_ratio, Length: 61, dtype: float64

A DataFrame and the underlying Series objects holding the column data also provide lots of built-in methods for mathematical operations and basic statistics via the . operator.

Exercise#

From the loaded CSV file, create a table that only contains these columns:

minor_axis_length
major_axis_length
mean_intensity

Also, select all rows where value mean_intensity is smaller than the mean of the column mean_intensity.

df_loaded = pd.read_csv('../data/blobs_statistics.csv', index_col=0)