Prepare the data

Prepare the data#

Before we train our first regression model, the data must be prepared for the machine learning approach. For example, we must transform the character values to numerical attributes as in the previous subsection. Furthermore, some algorithm are sensitive to the scaling of the input features, e.g. support vector regression or k-nearest neighbor regression. It might be also that some values in selected columns are missing. We will show some of these transformations first before we setup a pipeline which will make the data handling much easier.

At first, we will pick all input data X from our training data by using drop from pandas to remove the energy column.

# code from previous notebooks of this section
import pandas as pd
from sklearn.model_selection import train_test_split

hb_data = pd.read_csv('HB_data.csv')
train_set, test_set = train_test_split(hb_data, test_size=0.2, random_state=42)
# end code from previous notebooks of this section

X_train = train_set.drop(['energy'], axis=1)
X_train.head()

	bo-acc	bo-donor	q-acc	q-donor	q-hatom	dist-dh	dist-ah	atomtype-acc	atomtype-don
63	0.2549	1.1085	0.167554	-0.178104	-0.030259	0.965293	2.034707	S	O
1316	0.1725	0.8950	-0.261959	0.276786	0.108965	1.055615	1.844385	O	N
1018	0.2110	1.0962	0.205667	-0.182710	-0.008530	0.970337	2.129663	S	O
1046	0.1783	1.0630	-0.377025	-0.211999	0.049246	0.996096	1.903904	O	O
1149	0.0623	1.0837	-0.215193	-0.064738	0.078079	0.972071	2.327929	O	O

The first row refers to the instance in the original data set. A common problem might be that your data is not complete. Selected values might be missing in some columns. You can remove these instances with the dropna feature of pandas:

X_train.dropna(subset=["q-acc"], inplace=True)

Since our data set is complete, no instance will be removed from our example.

You can also provide a guess by fillna of pandas. Here is a code example where each missing value is replaced by the average value of this column:

median = X_train["q-acc"].median() 
X_train.fillna({"q-acc": median}, inplace=True)

You have already seen in a previous section how to transform characters into numerical attributes:

from sklearn.preprocessing import OrdinalEncoder

atomtype_acc_c = train_set[["atomtype-acc"]]

ordinal_encoder = OrdinalEncoder()
atomtype_acc_num = ordinal_encoder.fit_transform(atomtype_acc_c)

print("First five instances before transformation:")
print(atomtype_acc_c[:5])
print("\nFirst five instances after transformation:")
print(atomtype_acc_num[:5])

First five instances before transformation:
     atomtype-acc
63              S
1316            O
1018            S
1046            O
1149            O

First five instances after transformation:
[[4.]
 [3.]
 [4.]
 [3.]
 [3.]]

You can check the numbers assigned to each element by following line:

ordinal_encoder.categories_

[array(['Cl', 'F', 'N', 'O', 'S'], dtype=object)]

Please note, it starts counting with 0. Passing this to a regression algorithm is not reasonable since it will assume that two nearby values are more similar than two distant values. This is not the case!

A common solution is to create one binary attribute per category: The attribute for the element C is equal to 1 when the element is “C” and 0 otherwise. Scikit-learn provides a OneHotEncoder class to convert categorical values into one-hot vectors:

from sklearn.preprocessing import OneHotEncoder

atomtype_acc_c = train_set[["atomtype-acc"]]

cat_encoder = OneHotEncoder()
atomtype_acc_1hot = cat_encoder.fit_transform(atomtype_acc_c)

print("First five instances before transformation:")
print(atomtype_acc_c[:5])
print("\nFirst five instances after transformation:")
print(atomtype_acc_1hot[:5].toarray())

First five instances before transformation:
     atomtype-acc
63              S
1316            O
1018            S
1046            O
1149            O

First five instances after transformation:
[[0. 0. 0. 0. 1.]
 [0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1.]
 [0. 0. 0. 1. 0.]
 [0. 0. 0. 1. 0.]]

You can get the categories similar as for the ordinal encoder:

cat_encoder.categories_

[array(['Cl', 'F', 'N', 'O', 'S'], dtype=object)]

Please note, the one-hot encoder will raise an exception when the trained model will be applied to an instance with an unknown category.

Most machine learning approaches do not perform well on input features of different scales. Thus, it is important that you apply a feature scaling before passing the input features to the regression model. Scikit-Learn provides two approaches.

In min-max scaling (also called normalization), the values are scaled in a given range (default is between 0 and 1). This is performed by subtracting the minimum value and dividing it by the difference between the minimum and the maximum. Please note, some models work best on selected scales. For example, neural networks work best with a zero-mean input. Thus, a range from -1 to 1 is desirable since the activation function of neural networks change strongest close to zero. A code example is given below:

from sklearn.preprocessing import MinMaxScaler

X_train_num = train_set.drop(['energy','atomtype-acc','atomtype-don'], axis=1)

min_max_scaler = MinMaxScaler(feature_range=(-1, 1))
X_train_num_min_max = min_max_scaler.fit_transform(X_train_num)

print("First five instances before transformation:")
print(X_train_num[:5])
print("\nFirst five instances after transformation:")
print(X_train_num_min_max[:5])

First five instances before transformation:
      bo-acc  bo-donor     q-acc   q-donor   q-hatom   dist-dh   dist-ah
63    0.2549    1.1085  0.167554 -0.178104 -0.030259  0.965293  2.034707
1316  0.1725    0.8950 -0.261959  0.276786  0.108965  1.055615  1.844385
1018  0.2110    1.0962  0.205667 -0.182710 -0.008530  0.970337  2.129663
1046  0.1783    1.0630 -0.377025 -0.211999  0.049246  0.996096  1.903904
1149  0.0623    1.0837 -0.215193 -0.064738  0.078079  0.972071  2.327929

First five instances after transformation:
[[-0.0056319   0.4504764   0.72635861 -0.73536865 -0.38518884 -0.86668117
  -0.27765376]
 [-0.37688669 -0.49567915  0.03100271  0.74724875  0.68813457 -0.52531492
  -0.56979647]
 [-0.20342419  0.39596721  0.78806129 -0.75038093 -0.21767286 -0.84761657
  -0.1318979 ]
 [-0.35075467  0.24883669 -0.15528228 -0.84584221  0.22774124 -0.75026229
  -0.47843542]
 [-0.87339491  0.34057168  0.10671407 -0.36587624  0.45002428 -0.84106274
   0.17243787]]

Standardization is an alternative approach. First, it subtracts the mean value. Thus, the mean value is shifted to zero. Subsequently, the results are divided by the standard deviation. Thus, standardization is not restricted to values within a specific range and much less affected by outliers. You can setup this with following code:

from sklearn.preprocessing import StandardScaler

X_train_num = train_set.drop(['energy','atomtype-acc','atomtype-don'], axis=1)

std_scaler = StandardScaler()
X_train_num_std = std_scaler.fit_transform(X_train_num)

print("First five instances before transformation:")
print(X_train_num[:5])
print("\nFirst five instances after transformation:")
print(X_train_num_std[:5])

First five instances before transformation:
      bo-acc  bo-donor     q-acc   q-donor   q-hatom   dist-dh   dist-ah
63    0.2549    1.1085  0.167554 -0.178104 -0.030259  0.965293  2.034707
1316  0.1725    0.8950 -0.261959  0.276786  0.108965  1.055615  1.844385
1018  0.2110    1.0962  0.205667 -0.182710 -0.008530  0.970337  2.129663
1046  0.1783    1.0630 -0.377025 -0.211999  0.049246  0.996096  1.903904
1149  0.0623    1.0837 -0.215193 -0.064738  0.078079  0.972071  2.327929

First five instances after transformation:
[[ 1.5289407   1.08558958  1.30277591 -1.04442968 -1.7550715  -0.69528915
  -0.40588825]
 [ 0.4406232  -1.34253292 -0.43263547  1.96015514  1.51792787  0.11637256
  -1.0615082 ]
 [ 0.94912106  0.94570243  1.4567683  -1.07485268 -1.24424719 -0.64995954
  -0.07878632]
 [ 0.51722807  0.56812085 -0.89755003 -1.26830886  0.11400147 -0.41848162
  -0.85647783]
 [-1.01486938  0.80354069 -0.24368135 -0.29563819  0.79183281 -0.63437658
   0.60419701]]

Please note, both approaches are not suited for data with a heavy tail. Most data will be still in a small range after the transformation. Thus, you should shrink the heavy tail before scaling!

Scikit-learn provides the possibility to setup a pipeline for the transformations. This will make it much easier to execute everything correctly, for example to run the fitted model on the test data set. Below is the example for a standard scaling and a one-hot encoder.

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector
import numpy as np

#This line will get solely the column energy from the training data set.
y_train = train_set['energy']
#This line will remove solely the column energy from the training data set.
X_train = train_set.drop(['energy'], axis=1)

num_pipeline = make_pipeline(StandardScaler())
# If you want to give a specific name your pipeline, you can also use following two lines of code:
# from sklearn.pipeline import Pipeline
# num_pipeline = Pipeline([("scaling", StandardScaler()),])
cat_pipeline = make_pipeline(OneHotEncoder()) 

# The num_pipeline is solely applied to numerical values while cat_pipelines is used on characters
preprocessing = ColumnTransformer([("num",num_pipeline, make_column_selector(dtype_include=np.number)),
                                        ("cat",cat_pipeline, make_column_selector(dtype_include=object))])
# You can also apply a transformer solely to selected features:
#cat_attribs = ["atomtype-acc"]
#preprocessing = ColumnTransformer([("cat-acc",cat_pipeline, cat_attribs)],remainder='passthrough')

X_train_prepared = preprocessing.fit_transform(X_train)

print("These are the first 5 instances of our input features after the pipeline:")
print(X_train_prepared[:5])

These are the first 5 instances of our input features after the pipeline:
[[ 1.5289407   1.08558958  1.30277591 -1.04442968 -1.7550715  -0.69528915
  -0.40588825  0.          0.          0.          0.          1.
   0.          0.          1.          0.        ]
 [ 0.4406232  -1.34253292 -0.43263547  1.96015514  1.51792787  0.11637256
  -1.0615082   0.          0.          0.          1.          0.
   0.          1.          0.          0.        ]
 [ 0.94912106  0.94570243  1.4567683  -1.07485268 -1.24424719 -0.64995954
  -0.07878632  0.          0.          0.          0.          1.
   0.          0.          1.          0.        ]
 [ 0.51722807  0.56812085 -0.89755003 -1.26830886  0.11400147 -0.41848162
  -0.85647783  0.          0.          0.          1.          0.
   0.          0.          1.          0.        ]
 [-1.01486938  0.80354069 -0.24368135 -0.29563819  0.79183281 -0.63437658
   0.60419701  0.          0.          0.          1.          0.
   0.          0.          1.          0.        ]]

Thus, our data set is now ready for training our first regression model. Please note, never use fit() or fit_transform() on anything else then the training data! You can use transform() on any data. You can get the column names of your pipeline by using pandas:

import pandas as pd

pd.DataFrame(X_train_prepared, columns=preprocessing.get_feature_names_out())

	num__bo-acc	num__bo-donor	num__q-acc	num__q-donor	num__q-hatom	num__dist-dh	num__dist-ah	cat__atomtype-acc_Cl	cat__atomtype-acc_F	cat__atomtype-acc_N	cat__atomtype-acc_O	cat__atomtype-acc_S	cat__atomtype-don_F	cat__atomtype-don_N	cat__atomtype-don_O	cat__atomtype-don_S
0	1.528941	1.085590	1.302776	-1.044430	-1.755071	-0.695289	-0.405888	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0
1	0.440623	-1.342533	-0.432635	1.960155	1.517928	0.116373	-1.061508	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	0.0
2	0.949121	0.945702	1.456768	-1.074853	-1.244247	-0.649960	-0.078786	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0
3	0.517228	0.568121	-0.897550	-1.268309	0.114001	-0.418482	-0.856478	0.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0
4	-1.014869	0.803541	-0.243681	-0.295638	0.791833	-0.634377	0.604197	0.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1305	0.699495	-1.220842	1.126638	1.097076	0.425000	-0.107656	1.091242	0.0	0.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0
1306	0.900252	-0.223436	-2.857524	-1.191333	-0.740334	-0.023279	0.724262	0.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0
1307	-0.777130	0.341799	0.155201	-0.975407	-0.225631	-0.235411	0.106780	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	0.0
1308	0.020617	0.327015	-0.032606	0.012793	-1.577556	2.758347	-1.385322	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	1.0
1309	2.184044	-1.547246	1.382905	1.018680	-0.542061	-0.026322	-0.317851	0.0	0.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0

1310 rows × 16 columns

Exercise#

Make a pipeline with a MinMaxScaler in the range from 0 to 1 for the input features of “housing_data.csv” which is in the same directory as this notebook.

# your code here

Prepare the data

Contents

Prepare the data#

Exercise#