Linear Regression#
We will start with a linear regression since this approach was often used in the literature to predict the hydrogen bond energy. It has following form:
\( \hat{y} = \mathbf{w}\cdot\mathbf{x} = \mathbf{w^T}\mathbf{x} \)
\(\mathbf{x}\) is the feature vector of an instance and contains \(x_0\) to \(x_m\) where \(x_0=1\). \(w\) is the parameter vector containing the bias term \(w_0\) and the feature weights \(w_1\) to \(w_m\).
We can define a cost function \(J\) which should be minimized during training the model. Scikit-learn used the mean square error (MSE) (also called L2 loss) for the linear regression:
\( J(\mathbf{w})=\underbrace{\frac{1}{m}\sum\limits_{i=1}^m\left(\mathbf{w}^T\mathbf{x}^{(i)}-y^{(i)}\right)^2}_{MSE} \)
An alternative would be the mean absolute error (also called L1 loss). The implementation in scikit-learn to minimize the MSE for a linear regression scales linear with the number of instances \(m\) and cubic with the number of input features \(n\). You can use the stochastic gradient descent regressor in scikit-learn (SGDRegressor) to fit a linear regression model for data sets with a large number of input features. The SGDRegressor scales linear with the number of instances and features.
We can easily setup a pipeline for a linear regression with Scikit-Learn:
# code from previous notebooks of this section
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector
import numpy as np
hb_data = pd.read_csv('HB_data.csv')
train_set, test_set = train_test_split(hb_data, test_size=0.2, random_state=42)
y_train = train_set['energy']
X_train = train_set.drop(['energy'], axis=1)
num_pipeline = make_pipeline(StandardScaler())
cat_pipeline = make_pipeline(OneHotEncoder())
preprocessing = ColumnTransformer([("num",num_pipeline, make_column_selector(dtype_include=np.number)),
("cat",cat_pipeline, make_column_selector(dtype_include=object))])
# end code from previous notebooks of this section
from sklearn.linear_model import LinearRegression
# fit_intercepts determines if the bias term is set to zero (False) or will be fitted (True).
model_linr = make_pipeline(preprocessing, LinearRegression(fit_intercept=True))
model_linr.fit(X_train, y_train)
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('num', Pipeline(steps=[('standardscaler', StandardScaler())]), <sklearn.compose._column_transformer.make_column_selector object at 0x711a84674620>), ('cat', Pipeline(steps=[('onehotencoder', OneHotEncoder())]), <sklearn.compose._column_transformer.make_column_selector object at 0x711a84674260>)])), ('linearregression', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
steps | [('columntransformer', ...), ('linearregression', ...)] | |
transform_input | None | |
memory | None | |
verbose | False |
Parameters
transformers | [('num', ...), ('cat', ...)] | |
remainder | 'drop' | |
sparse_threshold | 0.3 | |
n_jobs | None | |
transformer_weights | None | |
verbose | False | |
verbose_feature_names_out | True | |
force_int_remainder_cols | 'deprecated' |
<sklearn.compose._column_transformer.make_column_selector object at 0x711a84674620>
Parameters
copy | True | |
with_mean | True | |
with_std | True |
<sklearn.compose._column_transformer.make_column_selector object at 0x711a84674260>
Parameters
categories | 'auto' | |
drop | None | |
sparse_output | True | |
dtype | <class 'numpy.float64'> | |
handle_unknown | 'error' | |
min_frequency | None | |
max_categories | None | |
feature_name_combiner | 'concat' |
Parameters
fit_intercept | True | |
copy_X | True | |
tol | 1e-06 | |
n_jobs | None | |
positive | False |
We can now use our model to predict the hydrogen bond energy. In the example below, we use the instances of our training data:
HB_predict = model_linr.predict(X_train)
HB_predict[:5].round(2)
array([ 1.88, -91.99, -4.82, -62.76, -38.48])
We compare it to the label of our training data:
y_train[:5].round(2)
63 2.25
1316 -108.41
1018 -14.47
1046 -65.69
1149 -27.65
Name: energy, dtype: float64
It does not look so good. Let us do the comparison in a more systematic way. Scikit-Learn provides diverse options. A possibility is the coefficient of determination (\(R^2\)) which is given by:
\( R^2 = 1-\frac{\sum\limits_{i=1}^m\left(y^{(i)}-h(\mathbf{x}^{(i)})\right)^2}{\sum\limits_{i=1}^m\left(y^{(i)}-\overline{y}\right)^2} \)
\(\overline{y}\) is the average value of \(y\). A determination coefficient close to 1 would be desirable. More examples to evaluate the model are given below:
from sklearn.metrics import r2_score
lin_r2 = r2_score(y_train, HB_predict)
print("The model has a coefficient of determination of %0.2f on the training data set\n" % lin_r2)
from sklearn.metrics import mean_absolute_error
lin_mae = mean_absolute_error(y_train, HB_predict)
print("The model has a mean absolute error of %0.2f kJ/mol on the training data set\n" % lin_mae)
from sklearn.metrics import mean_absolute_percentage_error
lin_maep = mean_absolute_percentage_error(y_train, HB_predict)
print("The model has a mean absolute percentage error of %0.3f on the training data set\n" % lin_maep)
from sklearn.metrics import mean_squared_error
lin_mse = mean_squared_error(y_train, HB_predict)
print("The model has a mean squared error of %0.2f kJ/mol on the training data set\n" % lin_mse)
from sklearn.metrics import root_mean_squared_error
lin_rmse = root_mean_squared_error(y_train, HB_predict)
print("The model has a root mean squared error of %0.2f kJ/mol on the training data set\n" % lin_rmse)
The model has a coefficient of determination of 0.90 on the training data set
The model has a mean absolute error of 6.77 kJ/mol on the training data set
The model has a mean absolute percentage error of 0.558 on the training data set
The model has a mean squared error of 87.06 kJ/mol on the training data set
The model has a root mean squared error of 9.33 kJ/mol on the training data set
The evaluation above has a drawback: All values are compared for data which was used during the training. Thus, we have no chance to detect overfitting! It would be much better to use our test data set for validation. However, we would like to compare different approaches subsequently. If we take for validation the test data set now, it will affect our final choice. Thus, we do not have a good proof for the accuracy of our model on unseen data. We want to use the test data set solely on our final model!
To solve this problem, we can split our training data set again and use only a part of it for training a model. The other part is used to validate the model on unseen data. In practice, one uses n-fold cross validation. It splits the training data set in n subregions of similar size. Subsequently, one subregion is used vor validation while the rest is used to train the model. This is repeated for all subregions. Below is the code of a 5-fold cross-validation of our linear regression model:
from sklearn.model_selection import cross_val_score
scores = -cross_val_score(model_linr, X_train, y_train, scoring='neg_root_mean_squared_error', cv=5)
print(f"Root mean square error of each validation in kJ/mol:\n{scores}\n")
print("This is an average root mean square error of %0.2f kJ/mol with a standard deviation of %0.2f kJ/mol\n" % (scores.mean(), scores.std()))
Root mean square error of each validation in kJ/mol:
[10.23310927 10.13593914 8.86243621 9.07401867 8.99106553]
This is an average root mean square error of 9.46 kJ/mol with a standard deviation of 0.60 kJ/mol
The comparison of the 5 models allows us to estimate how strong the fit depends on the provided data.