Prepare the data#
Before we train our first regression model, the data must be prepared for the machine learning approach. For example, we must transform the character values to numerical attributes as in the previous subsection. Furthermore, some algorithm are sensitive to the scaling of the input features, e.g. support vector regression or k-nearest neighbor regression. It might be also that some values in selected columns are missing. We will show some of these transformations first before we setup a pipeline which will make the data handling much easier.
At first, we will pick all input data X from our training data by using drop from pandas to remove the energy column.
# code from previous notebooks of this section
import pandas as pd
from sklearn.model_selection import train_test_split
hb_data = pd.read_csv('HB_data.csv')
train_set, test_set = train_test_split(hb_data, test_size=0.2, random_state=42)
# end code from previous notebooks of this section
X_train = train_set.drop(['energy'], axis=1)
X_train.head()
bo-acc | bo-donor | q-acc | q-donor | q-hatom | dist-dh | dist-ah | atomtype-acc | atomtype-don | |
---|---|---|---|---|---|---|---|---|---|
63 | 0.2549 | 1.1085 | 0.167554 | -0.178104 | -0.030259 | 0.965293 | 2.034707 | S | O |
1316 | 0.1725 | 0.8950 | -0.261959 | 0.276786 | 0.108965 | 1.055615 | 1.844385 | O | N |
1018 | 0.2110 | 1.0962 | 0.205667 | -0.182710 | -0.008530 | 0.970337 | 2.129663 | S | O |
1046 | 0.1783 | 1.0630 | -0.377025 | -0.211999 | 0.049246 | 0.996096 | 1.903904 | O | O |
1149 | 0.0623 | 1.0837 | -0.215193 | -0.064738 | 0.078079 | 0.972071 | 2.327929 | O | O |
The first row refers to the instance in the original data set. A common problem might be that your data is not complete. Selected values might be missing in some columns. You can remove these instances with the dropna feature of pandas:
X_train.dropna(subset=["q-acc"], inplace=True)
Since our data set is complete, no instance will be removed from our example.
You can also provide a guess by fillna of pandas. Here is a code example where each missing value is replaced by the average value of this column:
median = X_train["q-acc"].median()
X_train.fillna({"q-acc": median}, inplace=True)
You have already seen in a previous section how to transform characters into numerical attributes:
from sklearn.preprocessing import OrdinalEncoder
atomtype_acc_c = train_set[["atomtype-acc"]]
ordinal_encoder = OrdinalEncoder()
atomtype_acc_num = ordinal_encoder.fit_transform(atomtype_acc_c)
print("First five instances before transformation:")
print(atomtype_acc_c[:5])
print("\nFirst five instances after transformation:")
print(atomtype_acc_num[:5])
First five instances before transformation:
atomtype-acc
63 S
1316 O
1018 S
1046 O
1149 O
First five instances after transformation:
[[4.]
[3.]
[4.]
[3.]
[3.]]
You can check the numbers assigned to each element by following line:
ordinal_encoder.categories_
[array(['Cl', 'F', 'N', 'O', 'S'], dtype=object)]
Please note, it starts counting with 0. Passing this to a regression algorithm is not reasonable since it will assume that two nearby values are more similar than two distant values. This is not the case!
A common solution is to create one binary attribute per category: The attribute for the element C is equal to 1 when the element is “C” and 0 otherwise. Scikit-learn provides a OneHotEncoder class to convert categorical values into one-hot vectors:
from sklearn.preprocessing import OneHotEncoder
atomtype_acc_c = train_set[["atomtype-acc"]]
cat_encoder = OneHotEncoder()
atomtype_acc_1hot = cat_encoder.fit_transform(atomtype_acc_c)
print("First five instances before transformation:")
print(atomtype_acc_c[:5])
print("\nFirst five instances after transformation:")
print(atomtype_acc_1hot[:5].toarray())
First five instances before transformation:
atomtype-acc
63 S
1316 O
1018 S
1046 O
1149 O
First five instances after transformation:
[[0. 0. 0. 0. 1.]
[0. 0. 0. 1. 0.]
[0. 0. 0. 0. 1.]
[0. 0. 0. 1. 0.]
[0. 0. 0. 1. 0.]]
You can get the categories similar as for the ordinal encoder:
cat_encoder.categories_
[array(['Cl', 'F', 'N', 'O', 'S'], dtype=object)]
Please note, the one-hot encoder will raise an exception when the trained model will be applied to an instance with an unknown category.
Most machine learning approaches do not perform well on input features of different scales. Thus, it is important that you apply a feature scaling before passing the input features to the regression model. Scikit-Learn provides two approaches.
In min-max scaling (also called normalization), the values are scaled in a given range (default is between 0 and 1). This is performed by subtracting the minimum value and dividing it by the difference between the minimum and the maximum. Please note, some models work best on selected scales. For example, neural networks work best with a zero-mean input. Thus, a range from -1 to 1 is desirable since the activation function of neural networks change strongest close to zero. A code example is given below:
from sklearn.preprocessing import MinMaxScaler
X_train_num = train_set.drop(['energy','atomtype-acc','atomtype-don'], axis=1)
min_max_scaler = MinMaxScaler(feature_range=(-1, 1))
X_train_num_min_max = min_max_scaler.fit_transform(X_train_num)
print("First five instances before transformation:")
print(X_train_num[:5])
print("\nFirst five instances after transformation:")
print(X_train_num_min_max[:5])
First five instances before transformation:
bo-acc bo-donor q-acc q-donor q-hatom dist-dh dist-ah
63 0.2549 1.1085 0.167554 -0.178104 -0.030259 0.965293 2.034707
1316 0.1725 0.8950 -0.261959 0.276786 0.108965 1.055615 1.844385
1018 0.2110 1.0962 0.205667 -0.182710 -0.008530 0.970337 2.129663
1046 0.1783 1.0630 -0.377025 -0.211999 0.049246 0.996096 1.903904
1149 0.0623 1.0837 -0.215193 -0.064738 0.078079 0.972071 2.327929
First five instances after transformation:
[[-0.0056319 0.4504764 0.72635861 -0.73536865 -0.38518884 -0.86668117
-0.27765376]
[-0.37688669 -0.49567915 0.03100271 0.74724875 0.68813457 -0.52531492
-0.56979647]
[-0.20342419 0.39596721 0.78806129 -0.75038093 -0.21767286 -0.84761657
-0.1318979 ]
[-0.35075467 0.24883669 -0.15528228 -0.84584221 0.22774124 -0.75026229
-0.47843542]
[-0.87339491 0.34057168 0.10671407 -0.36587624 0.45002428 -0.84106274
0.17243787]]
Standardization is an alternative approach. First, it subtracts the mean value. Thus, the mean value is shifted to zero. Subsequently, the results are divided by the standard deviation. Thus, standardization is not restricted to values within a specific range and much less affected by outliers. You can setup this with following code:
from sklearn.preprocessing import StandardScaler
X_train_num = train_set.drop(['energy','atomtype-acc','atomtype-don'], axis=1)
std_scaler = StandardScaler()
X_train_num_std = std_scaler.fit_transform(X_train_num)
print("First five instances before transformation:")
print(X_train_num[:5])
print("\nFirst five instances after transformation:")
print(X_train_num_std[:5])
First five instances before transformation:
bo-acc bo-donor q-acc q-donor q-hatom dist-dh dist-ah
63 0.2549 1.1085 0.167554 -0.178104 -0.030259 0.965293 2.034707
1316 0.1725 0.8950 -0.261959 0.276786 0.108965 1.055615 1.844385
1018 0.2110 1.0962 0.205667 -0.182710 -0.008530 0.970337 2.129663
1046 0.1783 1.0630 -0.377025 -0.211999 0.049246 0.996096 1.903904
1149 0.0623 1.0837 -0.215193 -0.064738 0.078079 0.972071 2.327929
First five instances after transformation:
[[ 1.5289407 1.08558958 1.30277591 -1.04442968 -1.7550715 -0.69528915
-0.40588825]
[ 0.4406232 -1.34253292 -0.43263547 1.96015514 1.51792787 0.11637256
-1.0615082 ]
[ 0.94912106 0.94570243 1.4567683 -1.07485268 -1.24424719 -0.64995954
-0.07878632]
[ 0.51722807 0.56812085 -0.89755003 -1.26830886 0.11400147 -0.41848162
-0.85647783]
[-1.01486938 0.80354069 -0.24368135 -0.29563819 0.79183281 -0.63437658
0.60419701]]
Please note, both approaches are not suited for data with a heavy tail. Most data will be still in a small range after the transformation. Thus, you should shrink the heavy tail before scaling!
Scikit-learn provides the possibility to setup a pipeline for the transformations. This will make it much easier to execute everything correctly, for example to run the fitted model on the test data set. Below is the example for a standard scaling and a one-hot encoder.
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector
import numpy as np
#This line will get solely the column energy from the training data set.
y_train = train_set['energy']
#This line will remove solely the column energy from the training data set.
X_train = train_set.drop(['energy'], axis=1)
num_pipeline = make_pipeline(StandardScaler())
# If you want to give a specific name your pipeline, you can also use following two lines of code:
# from sklearn.pipeline import Pipeline
# num_pipeline = Pipeline([("scaling", StandardScaler()),])
cat_pipeline = make_pipeline(OneHotEncoder())
# The num_pipeline is solely applied to numerical values while cat_pipelines is used on characters
preprocessing = ColumnTransformer([("num",num_pipeline, make_column_selector(dtype_include=np.number)),
("cat",cat_pipeline, make_column_selector(dtype_include=object))])
# You can also apply a transformer solely to selected features:
#cat_attribs = ["atomtype-acc"]
#preprocessing = ColumnTransformer([("cat-acc",cat_pipeline, cat_attribs)],remainder='passthrough')
X_train_prepared = preprocessing.fit_transform(X_train)
print("These are the first 5 instances of our input features after the pipeline:")
print(X_train_prepared[:5])
These are the first 5 instances of our input features after the pipeline:
[[ 1.5289407 1.08558958 1.30277591 -1.04442968 -1.7550715 -0.69528915
-0.40588825 0. 0. 0. 0. 1.
0. 0. 1. 0. ]
[ 0.4406232 -1.34253292 -0.43263547 1.96015514 1.51792787 0.11637256
-1.0615082 0. 0. 0. 1. 0.
0. 1. 0. 0. ]
[ 0.94912106 0.94570243 1.4567683 -1.07485268 -1.24424719 -0.64995954
-0.07878632 0. 0. 0. 0. 1.
0. 0. 1. 0. ]
[ 0.51722807 0.56812085 -0.89755003 -1.26830886 0.11400147 -0.41848162
-0.85647783 0. 0. 0. 1. 0.
0. 0. 1. 0. ]
[-1.01486938 0.80354069 -0.24368135 -0.29563819 0.79183281 -0.63437658
0.60419701 0. 0. 0. 1. 0.
0. 0. 1. 0. ]]
Thus, our data set is now ready for training our first regression model. Please note, never use fit() or fit_transform() on anything else then the training data! You can use transform() on any data. You can get the column names of your pipeline by using pandas:
import pandas as pd
pd.DataFrame(X_train_prepared, columns=preprocessing.get_feature_names_out())
num__bo-acc | num__bo-donor | num__q-acc | num__q-donor | num__q-hatom | num__dist-dh | num__dist-ah | cat__atomtype-acc_Cl | cat__atomtype-acc_F | cat__atomtype-acc_N | cat__atomtype-acc_O | cat__atomtype-acc_S | cat__atomtype-don_F | cat__atomtype-don_N | cat__atomtype-don_O | cat__atomtype-don_S | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.528941 | 1.085590 | 1.302776 | -1.044430 | -1.755071 | -0.695289 | -0.405888 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
1 | 0.440623 | -1.342533 | -0.432635 | 1.960155 | 1.517928 | 0.116373 | -1.061508 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
2 | 0.949121 | 0.945702 | 1.456768 | -1.074853 | -1.244247 | -0.649960 | -0.078786 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
3 | 0.517228 | 0.568121 | -0.897550 | -1.268309 | 0.114001 | -0.418482 | -0.856478 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
4 | -1.014869 | 0.803541 | -0.243681 | -0.295638 | 0.791833 | -0.634377 | 0.604197 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1305 | 0.699495 | -1.220842 | 1.126638 | 1.097076 | 0.425000 | -0.107656 | 1.091242 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 |
1306 | 0.900252 | -0.223436 | -2.857524 | -1.191333 | -0.740334 | -0.023279 | 0.724262 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
1307 | -0.777130 | 0.341799 | 0.155201 | -0.975407 | -0.225631 | -0.235411 | 0.106780 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
1308 | 0.020617 | 0.327015 | -0.032606 | 0.012793 | -1.577556 | 2.758347 | -1.385322 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
1309 | 2.184044 | -1.547246 | 1.382905 | 1.018680 | -0.542061 | -0.026322 | -0.317851 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 |
1310 rows × 16 columns
Exercise#
Make a pipeline with a MinMaxScaler in the range from 0 to 1 for the input features of housing.csv which is in the same directory as this notebook.
# your code here