Create a test data set#

Since we will train a supervised model, splitting the data in a train and test set is the next step. This will provide a data set to estimate the generalization error of our final model.

# code from previous notebooks of this section
import pandas as pd
hb_data = pd.read_csv('HB_data.csv')
# end code from previous notebooks of this section

from sklearn.model_selection import train_test_split

# We set a fixed random number to provide reproducible results. 
# 20% of the data is stored as test data and 80% as training data. 
train_set, test_set = train_test_split(hb_data, test_size=0.2, random_state=42)
print(f"Instances of training data set:{train_set.shape}")
print(f"Instances of test data set:{test_set.shape}")
Instances of training data set:(1310, 10)
Instances of test data set:(328, 10)

We will use the training data set until we have identified our best model.