Selecting data from DataFrames#
Like with Python sequences, e.g., lists, we may want to get specific parts of the data in a DataFrame. And like with Python sequences, we can use indexing, slicing, and masking on DataFrames, too.
See also:
We start by reading a CSV file into a DataFrame again
import pandas as pd
df = pd.read_csv('data/measurements_1.csv')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 label 10 non-null object
1 value1 10 non-null float64
2 value2 8 non-null float64
3 value3 9 non-null float64
4 valid 10 non-null bool
dtypes: bool(1), float64(3), object(1)
memory usage: 458.0+ bytes
Pandas provides methods to display just the first n
rows or last n
rows of a DataFrame - head()
and tail()
.
df.head(3)
label | value1 | value2 | value3 | valid | |
---|---|---|---|---|---|
0 | A | 9.0 | 2.0 | 3.1 | True |
1 | B | 11.0 | NaN | 0.4 | False |
2 | C | 10.2 | 8.0 | 5.6 | True |
We can also get the column names via columns
. This will provide an Index
object with the column names and some metadata.
df.columns
Index(['label', 'value1', 'value2', 'value3', 'valid'], dtype='object')
Selecting rows and columns#
Let’s get the DataFrames first row
#df[0]
Ooops…this raises a KeyError
. For DataFrames, the []
operator expects labels only, and works on the columns, not the rows. We can use it to get a subset of the DataFrame with the specified column labels:
df['value1']
0 9.0
1 11.0
2 10.2
3 9.5
4 15.0
5 400.0
6 9.0
7 11.0
8 11.3
9 10.0
Name: value1, dtype: float64
Notice that the resulting data structure has the column data and an index, too. That’s because Pandas DataFrames use Pandas Series for the columns, which are one-dimensional arrays with an index for the rows and a label for the column data. It can be seen as a single-column table.
type(df['value1'])
pandas.core.series.Series
We can get more columns by passing their names as a list. Furthermore, we can store this “sub-dataframe” in a new variable.
df_subset = df[['label', 'value1']]
df_subset
label | value1 | |
---|---|---|
0 | A | 9.0 |
1 | B | 11.0 |
2 | C | 10.2 |
3 | D | 9.5 |
4 | E | 15.0 |
5 | F | 400.0 |
6 | G | 9.0 |
7 | H | 11.0 |
8 | I | 11.3 |
9 | J | 10.0 |
Label-location based selection#
If we want to select specific rows, we need to use the method loc[]
. Note, that loc[]
also works with labels only. Let’s get the row with label 0 (its index label).
df.loc[0]
label A
value1 9.0
value2 2.0
value3 3.1
valid True
Name: 0, dtype: object
We can also select more than one row by providing the row labels as a list. In addition, we can select specific columns, too:
df.loc[[0,2], ['label', 'value1']]
label | value1 | |
---|---|---|
0 | A | 9.0 |
2 | C | 10.2 |
We can also use slicing via [from:to]
to get data starting from
a specific label up to
a specific label.
df.loc[0:2, 'label':'value2']
label | value1 | value2 | |
---|---|---|---|
0 | A | 9.0 | 2.0 |
1 | B | 11.0 | NaN |
2 | C | 10.2 | 8.0 |
Attention: Slicing on label-location requires that the order of the labels in the row and column index is known and taken into account:
df.loc[0:2, 'value2':'label']
0 |
---|
1 |
2 |
Integer-location based selection#
If we want to select data based on the indices’ integer positions, we need to use iloc[]
. Let’s get the value from the first row and the first column:
df.iloc[0, 0]
'A'
We can use slicing to get specific parts of the data. Let’s get the first 5 rows and the first 3 columns:
df.iloc[0:5, 0:3]
label | value1 | value2 | |
---|---|---|---|
0 | A | 9.0 | 2.0 |
1 | B | 11.0 | NaN |
2 | C | 10.2 | 8.0 |
3 | D | 9.5 | 5.0 |
4 | E | 15.0 | 7.0 |
Let’s get everything starting from the 5th row and starting from the 2nd column:
df.iloc[4:, 1:]
value1 | value2 | value3 | valid | |
---|---|---|---|---|
4 | 15.0 | 7.0 | 4.4 | False |
5 | 400.0 | NaN | 2.2 | True |
6 | 9.0 | 1.0 | 1.1 | False |
7 | 11.0 | 4.0 | 0.9 | True |
8 | 11.3 | 6.0 | 3.3 | True |
9 | 10.0 | 9.0 | 4.1 | False |
To clarify the difference between label-location and integer-location again, let’s try both approaches again on a sorted dataset:
df_sorted = df.sort_values(by='value1', ascending=False)
df_sorted
label | value1 | value2 | value3 | valid | |
---|---|---|---|---|---|
5 | F | 400.0 | NaN | 2.2 | True |
4 | E | 15.0 | 7.0 | 4.4 | False |
8 | I | 11.3 | 6.0 | 3.3 | True |
1 | B | 11.0 | NaN | 0.4 | False |
7 | H | 11.0 | 4.0 | 0.9 | True |
2 | C | 10.2 | 8.0 | 5.6 | True |
9 | J | 10.0 | 9.0 | 4.1 | False |
3 | D | 9.5 | 5.0 | NaN | True |
0 | A | 9.0 | 2.0 | 3.1 | True |
6 | G | 9.0 | 1.0 | 1.1 | False |
As we can see above, the index labels are not in order anymore. Let’s apply loc[]
and iloc[]
again, to get the “first three rows”:
# label-location
df_sorted.loc[[0,1,2]]
label | value1 | value2 | value3 | valid | |
---|---|---|---|---|---|
0 | A | 9.0 | 2.0 | 3.1 | True |
1 | B | 11.0 | NaN | 0.4 | False |
2 | C | 10.2 | 8.0 | 5.6 | True |
# integer-location
df_sorted.iloc[0:3]
label | value1 | value2 | value3 | valid | |
---|---|---|---|---|---|
5 | F | 400.0 | NaN | 2.2 | True |
4 | E | 15.0 | 7.0 | 4.4 | False |
8 | I | 11.3 | 6.0 | 3.3 | True |
Boolean indexing#
With boolean indexing (or masking), we can select specific subsets of the data based on conditional / logical queries. Suppose we want to get entries for which “value1” > 40, we can use this condition in the []
operator.
mask = df["value1"] > 40
mask
0 False
1 False
2 False
3 False
4 False
5 True
6 False
7 False
8 False
9 False
Name: value1, dtype: bool
# apply the mask to the DataFrame: df[mask]
# Or as one-liner:
df[df["value1"] > 40]
label | value1 | value2 | value3 | valid | |
---|---|---|---|---|---|
5 | F | 400.0 | NaN | 2.2 | True |
We can also use this syntax with loc[]
to return specific columns only, e.g., “value1” and “valid”.
df.loc[(df["value1"] > 40), ['value1', 'valid']]
value1 | valid | |
---|---|---|
5 | 400.0 | True |
Get scalar values#
If we only want to access a single scalar value, the fastest way is to use the at
and iat
methods.
df.at[2, "value1"]
np.float64(10.2)
df.iat[0, 4]
np.True_
Math with DataFrames#
Now that we know how to select specific parts of a DataFrame, we want to do some math with it. Mathematical operations can be executed directly on a DataFrame or selected parts of it. We can also easily add new columns to the DataFrame to store computed results.
Let’s assume we want to compute the mean for a rows values and store this result in a new colum named value_mean. We can just define that new column and use the basic mathematical operator on the existing columns to fill it:
df['value_mean'] = (df['value1'] + df['value2'] + df['value3']) / 3
df
label | value1 | value2 | value3 | valid | value_mean | |
---|---|---|---|---|---|---|
0 | A | 9.0 | 2.0 | 3.1 | True | 4.700000 |
1 | B | 11.0 | NaN | 0.4 | False | NaN |
2 | C | 10.2 | 8.0 | 5.6 | True | 7.933333 |
3 | D | 9.5 | 5.0 | NaN | True | NaN |
4 | E | 15.0 | 7.0 | 4.4 | False | 8.800000 |
5 | F | 400.0 | NaN | 2.2 | True | NaN |
6 | G | 9.0 | 1.0 | 1.1 | False | 3.700000 |
7 | H | 11.0 | 4.0 | 0.9 | True | 5.300000 |
8 | I | 11.3 | 6.0 | 3.3 | True | 6.866667 |
9 | J | 10.0 | 9.0 | 4.1 | False | 7.700000 |
A DataFrame and the underlying Series objects holding the column data also provide lots of built-in methods for mathematical operations and basic statistics via the .
operator.
See also:
# Compute the mean for all numerical columns in the DataFrame
df.mean(numeric_only=True)
value1 49.600000
value2 5.250000
value3 2.788889
valid 0.600000
value_mean 6.428571
dtype: float64
# Get the max value from a column
df['value1'].max()
np.float64(400.0)
Exercise#
From our DataFrame, select all rows where the value in value2
is smaller than the mean of column value3
and where we know that our measurement is valid.