Selecting data from DataFrames#

Like with Python sequences, e.g., lists, we may want to get specific parts of the data in a DataFrame. And like with Python sequences, we can use indexing, slicing, and masking on DataFrames, too.

See also:

We start by reading a CSV file into a DataFrame again

import pandas as pd
df = pd.read_csv('data/measurements_1.csv')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   label   10 non-null     object 
 1   value1  10 non-null     float64
 2   value2  8 non-null      float64
 3   value3  9 non-null      float64
 4   valid   10 non-null     bool   
dtypes: bool(1), float64(3), object(1)
memory usage: 458.0+ bytes

Pandas provides methods to display just the first n rows or last n rows of a DataFrame - head() and tail().

df.head(3)
label value1 value2 value3 valid
0 A 9.0 2.0 3.1 True
1 B 11.0 NaN 0.4 False
2 C 10.2 8.0 5.6 True

We can also get the column names via columns. This will provide an Index object with the column names and some metadata.

df.columns
Index(['label', 'value1', 'value2', 'value3', 'valid'], dtype='object')

Selecting rows and columns#

Let’s get the DataFrames first row

#df[0]

Ooops…this raises a KeyError. For DataFrames, the [] operator expects labels only, and works on the columns, not the rows. We can use it to get a subset of the DataFrame with the specified column labels:

df['value1']
0      9.0
1     11.0
2     10.2
3      9.5
4     15.0
5    400.0
6      9.0
7     11.0
8     11.3
9     10.0
Name: value1, dtype: float64

Notice that the resulting data structure has the column data and an index, too. That’s because Pandas DataFrames use Pandas Series for the columns, which are one-dimensional arrays with an index for the rows and a label for the column data. It can be seen as a single-column table.

type(df['value1'])
pandas.core.series.Series

We can get more columns by passing their names as a list. Furthermore, we can store this “sub-dataframe” in a new variable.

df_subset = df[['label', 'value1']]
df_subset
label value1
0 A 9.0
1 B 11.0
2 C 10.2
3 D 9.5
4 E 15.0
5 F 400.0
6 G 9.0
7 H 11.0
8 I 11.3
9 J 10.0

Label-location based selection#

If we want to select specific rows, we need to use the method loc[]. Note, that loc[] also works with labels only. Let’s get the row with label 0 (its index label).

df.loc[0]
label        A
value1     9.0
value2     2.0
value3     3.1
valid     True
Name: 0, dtype: object

We can also select more than one row by providing the row labels as a list. In addition, we can select specific columns, too:

df.loc[[0,2], ['label', 'value1']]
label value1
0 A 9.0
2 C 10.2

We can also use slicing via [from:to] to get data starting from a specific label up to a specific label.

df.loc[0:2, 'label':'value2']
label value1 value2
0 A 9.0 2.0
1 B 11.0 NaN
2 C 10.2 8.0

Attention: Slicing on label-location requires that the order of the labels in the row and column index is known and taken into account:

df.loc[0:2, 'value2':'label']
0
1
2

Integer-location based selection#

If we want to select data based on the indices’ integer positions, we need to use iloc[]. Let’s get the value from the first row and the first column:

df.iloc[0, 0]
'A'

We can use slicing to get specific parts of the data. Let’s get the first 5 rows and the first 3 columns:

df.iloc[0:5, 0:3]
label value1 value2
0 A 9.0 2.0
1 B 11.0 NaN
2 C 10.2 8.0
3 D 9.5 5.0
4 E 15.0 7.0

Let’s get everything starting from the 5th row and starting from the 2nd column:

df.iloc[4:, 1:]
value1 value2 value3 valid
4 15.0 7.0 4.4 False
5 400.0 NaN 2.2 True
6 9.0 1.0 1.1 False
7 11.0 4.0 0.9 True
8 11.3 6.0 3.3 True
9 10.0 9.0 4.1 False

To clarify the difference between label-location and integer-location again, let’s try both approaches again on a sorted dataset:

df_sorted = df.sort_values(by='value1', ascending=False)
df_sorted
label value1 value2 value3 valid
5 F 400.0 NaN 2.2 True
4 E 15.0 7.0 4.4 False
8 I 11.3 6.0 3.3 True
1 B 11.0 NaN 0.4 False
7 H 11.0 4.0 0.9 True
2 C 10.2 8.0 5.6 True
9 J 10.0 9.0 4.1 False
3 D 9.5 5.0 NaN True
0 A 9.0 2.0 3.1 True
6 G 9.0 1.0 1.1 False

As we can see above, the index labels are not in order anymore. Let’s apply loc[] and iloc[] again, to get the “first three rows”:

# label-location
df_sorted.loc[[0,1,2]]
label value1 value2 value3 valid
0 A 9.0 2.0 3.1 True
1 B 11.0 NaN 0.4 False
2 C 10.2 8.0 5.6 True
# integer-location
df_sorted.iloc[0:3]
label value1 value2 value3 valid
5 F 400.0 NaN 2.2 True
4 E 15.0 7.0 4.4 False
8 I 11.3 6.0 3.3 True

Boolean indexing#

With boolean indexing (or masking), we can select specific subsets of the data based on conditional / logical queries. Suppose we want to get entries for which “value1” > 40, we can use this condition in the [] operator.

mask = df["value1"] > 40
mask
0    False
1    False
2    False
3    False
4    False
5     True
6    False
7    False
8    False
9    False
Name: value1, dtype: bool
# apply the mask to the DataFrame: df[mask]
# Or as one-liner:
df[df["value1"] > 40]
label value1 value2 value3 valid
5 F 400.0 NaN 2.2 True

We can also use this syntax with loc[] to return specific columns only, e.g., “value1” and “valid”.

df.loc[(df["value1"] > 40), ['value1', 'valid']]
value1 valid
5 400.0 True

Get scalar values#

If we only want to access a single scalar value, the fastest way is to use the at and iat methods.

df.at[2, "value1"]
np.float64(10.2)
df.iat[0, 4]
np.True_

Math with DataFrames#

Now that we know how to select specific parts of a DataFrame, we want to do some math with it. Mathematical operations can be executed directly on a DataFrame or selected parts of it. We can also easily add new columns to the DataFrame to store computed results.

Let’s assume we want to compute the mean for a rows values and store this result in a new colum named value_mean. We can just define that new column and use the basic mathematical operator on the existing columns to fill it:

df['value_mean'] = (df['value1'] + df['value2'] + df['value3']) / 3
df
label value1 value2 value3 valid value_mean
0 A 9.0 2.0 3.1 True 4.700000
1 B 11.0 NaN 0.4 False NaN
2 C 10.2 8.0 5.6 True 7.933333
3 D 9.5 5.0 NaN True NaN
4 E 15.0 7.0 4.4 False 8.800000
5 F 400.0 NaN 2.2 True NaN
6 G 9.0 1.0 1.1 False 3.700000
7 H 11.0 4.0 0.9 True 5.300000
8 I 11.3 6.0 3.3 True 6.866667
9 J 10.0 9.0 4.1 False 7.700000

A DataFrame and the underlying Series objects holding the column data also provide lots of built-in methods for mathematical operations and basic statistics via the . operator.

See also:

# Compute the mean for all numerical columns in the DataFrame
df.mean(numeric_only=True)
value1        49.600000
value2         5.250000
value3         2.788889
valid          0.600000
value_mean     6.428571
dtype: float64
# Get the max value from a column
df['value1'].max()
np.float64(400.0)

Exercise#

From our DataFrame, select all rows where the value in value2 is smaller than the mean of column value3 and where we know that our measurement is valid.