Selecting data from DataFrames

Selecting data from DataFrames#

Like with Python sequences, e.g., lists, we may want to get specific parts of the data in a DataFrame. And like with Python sequences, we can use indexing, slicing, and masking on DataFrames, too.

Selecting rows and columns#

Let’s get the DataFrames first row

#df[0]

Ooops…this raises a KeyError. For DataFrames, the [] operator expects labels only, and works on the columns, not the rows. We can use it to get a subset of the DataFrame with the specified column labels:

df['value1']

    9.0
   11.0
   10.2
    9.5
   15.0
  400.0
    9.0
   11.0
   11.3
   10.0
Name: value1, dtype: float64

Notice that the resulting data structure has the column data and an index, too. That’s because Pandas DataFrames use Pandas Series for the columns, which are one-dimensional arrays with an index for the rows and a label for the column data. It can be seen as a single-column table.

type(df['value1'])

pandas.core.series.Series

We can get more columns by passing their names as a list. Furthermore, we can store this “sub-dataframe” in a new variable.

df_subset = df[['label', 'value1']]
df_subset

	label	value1
0	A	9.0
1	B	11.0
2	C	10.2
3	D	9.5
4	E	15.0
5	F	400.0
6	G	9.0
7	H	11.0
8	I	11.3
9	J	10.0

Label-location based selection#

If we want to select specific rows, we need to use the method loc[]. Note, that loc[] also works with labels only. Let’s get the row with label 0 (its index label).

df.loc[0]

label        A
value1     9.0
value2     2.0
value3     3.1
valid     True
Name: 0, dtype: object

We can also select more than one row by providing the row labels as a list. In addition, we can select specific columns, too:

df.loc[[0,2], ['label', 'value1']]

	label	value1
0	A	9.0
2	C	10.2

We can also use slicing via [from:to] to get data starting from a specific label up to a specific label.

df.loc[0:2, 'label':'value2']

	label	value1	value2
0	A	9.0	2.0
1	B	11.0	NaN
2	C	10.2	8.0

Attention: Slicing on label-location requires that the order of the labels in the row and column index is known and taken into account:

df.loc[0:2, 'value2':'label']


0
1
2

Integer-location based selection#

If we want to select data based on the indices’ integer positions, we need to use iloc[]. Let’s get the value from the first row and the first column:

df.iloc[0, 0]

'A'

We can use slicing to get specific parts of the data. Let’s get the first 5 rows and the first 3 columns:

df.iloc[0:5, 0:3]

	label	value1	value2
0	A	9.0	2.0
1	B	11.0	NaN
2	C	10.2	8.0
3	D	9.5	5.0
4	E	15.0	7.0

Let’s get everything starting from the 5th row and starting from the 2nd column:

df.iloc[4:, 1:]

	value1	value2	value3	valid
4	15.0	7.0	4.4	False
5	400.0	NaN	2.2	True
6	9.0	1.0	1.1	False
7	11.0	4.0	0.9	True
8	11.3	6.0	3.3	True
9	10.0	9.0	4.1	False

To clarify the difference between label-location and integer-location again, let’s try both approaches again on a sorted dataset:

df_sorted = df.sort_values(by='value1', ascending=False)
df_sorted

	label	value1	value2	value3	valid
5	F	400.0	NaN	2.2	True
4	E	15.0	7.0	4.4	False
8	I	11.3	6.0	3.3	True
1	B	11.0	NaN	0.4	False
7	H	11.0	4.0	0.9	True
2	C	10.2	8.0	5.6	True
9	J	10.0	9.0	4.1	False
3	D	9.5	5.0	NaN	True
0	A	9.0	2.0	3.1	True
6	G	9.0	1.0	1.1	False

As we can see above, the index labels are not in order anymore. Let’s apply loc[] and iloc[] again, to get the “first three rows”:

# label-location
df_sorted.loc[[0,1,2]]

	label	value1	value2	value3	valid
0	A	9.0	2.0	3.1	True
1	B	11.0	NaN	0.4	False
2	C	10.2	8.0	5.6	True

# integer-location
df_sorted.iloc[0:3]

	label	value1	value2	value3	valid
5	F	400.0	NaN	2.2	True
4	E	15.0	7.0	4.4	False
8	I	11.3	6.0	3.3	True

Boolean indexing#

With boolean indexing (or masking), we can select specific subsets of the data based on conditional / logical queries. Suppose we want to get entries for which “value1” > 40, we can use this condition in the [] operator.

mask = df["value1"] > 40
mask

  False
  False
  False
  False
  False
   True
  False
  False
  False
  False
Name: value1, dtype: bool

# apply the mask to the DataFrame: df[mask]
# Or as one-liner:
df[df["value1"] > 40]

	label	value1	value2	value3	valid
5	F	400.0	NaN	2.2	True

We can also use this syntax with loc[] to return specific columns only, e.g., “value1” and “valid”.

df.loc[(df["value1"] > 40), ['value1', 'valid']]

	value1	valid
5	400.0	True

Get scalar values#

If we only want to access a single scalar value, the fastest way is to use the at and iat methods.

df.at[2, "value1"]

np.float64(10.2)

df.iat[0, 4]

np.True_

Math with DataFrames#

Now that we know how to select specific parts of a DataFrame, we want to do some math with it. Mathematical operations can be executed directly on a DataFrame or selected parts of it. We can also easily add new columns to the DataFrame to store computed results.

Let’s assume we want to compute the mean for a rows values and store this result in a new colum named value_mean. We can just define that new column and use the basic mathematical operator on the existing columns to fill it:

df['value_mean'] = (df['value1'] + df['value2'] + df['value3']) / 3
df

	label	value1	value2	value3	valid	value_mean
0	A	9.0	2.0	3.1	True	4.700000
1	B	11.0	NaN	0.4	False	NaN
2	C	10.2	8.0	5.6	True	7.933333
3	D	9.5	5.0	NaN	True	NaN
4	E	15.0	7.0	4.4	False	8.800000
5	F	400.0	NaN	2.2	True	NaN
6	G	9.0	1.0	1.1	False	3.700000
7	H	11.0	4.0	0.9	True	5.300000
8	I	11.3	6.0	3.3	True	6.866667
9	J	10.0	9.0	4.1	False	7.700000

A DataFrame and the underlying Series objects holding the column data also provide lots of built-in methods for mathematical operations and basic statistics via the . operator.

Exercise#

From our DataFrame, select all rows where the value in value2 is smaller than the mean of column value3 and where we know that our measurement is valid.