Regularized Linear Regression#

The question arises: Can we get a better fourth degree polynomial fit if we apply selected constrains to our weights during fitting? This can be done by regularized linear regression models where we will introduce three examples in more detail in the following.

Ridge Regression#

The first example is called ridge regression. The cost function \(J\) is extended as follow:

\( J(\mathbf{w})=\underbrace{\frac{1}{m}\sum\limits_{i=1}^m\left(\mathbf{w}^T\mathbf{x}^{(i)}-y^{(i)}\right)^2}_{MSE}+\underbrace{\frac{\alpha}{m}\sum\limits_{i=1}^n w_i^2}_{l2~norm} \)

The extension represents the l2 norm of the feature weights from 1 to \(n\). The bias term is not regularized since the sum over the features \(n\) starts at 1. Increasing the hyperparameter \(α\) leads to flatter predictions. Thus, the variance of the model is reduced and increasing \(\alpha\) to infinity will result in predicting the mean value of the training data independent of the input features. Let us check if we can get a better fourth degree polynomial by applying this technique:

# code from previous notebooks of this section
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector
from sklearn.model_selection import GridSearchCV
import numpy as np

hb_data = pd.read_csv('HB_data.csv')
train_set, test_set = train_test_split(hb_data, test_size=0.2, random_state=42)

y_train = train_set['energy']
X_train = train_set.drop(['energy'], axis=1)

num_pipeline = make_pipeline(PolynomialFeatures(degree=4, include_bias=True),StandardScaler())
cat_pipeline = make_pipeline(OneHotEncoder(sparse_output=False))

preprocessing = ColumnTransformer([("num",num_pipeline, make_column_selector(dtype_include=np.number)),
                                        ("cat",cat_pipeline, make_column_selector(dtype_include=object)),])
# end code from previous notebooks of this section 

from sklearn.linear_model import Ridge

model_ridge = make_pipeline(preprocessing, Ridge(random_state=42))

param_grid = [{'ridge__alpha': [0.0001,0.0005,0.001,0.005,0.01,0.05,0.1,0.5,1],
              }]
grid_search = GridSearchCV(model_ridge, param_grid, cv=5, scoring='neg_root_mean_squared_error')
grid_search.fit(X_train, y_train)

cv_hyperpara = pd.DataFrame(grid_search.cv_results_)
cv_hyperpara.sort_values(by="mean_test_score", ascending=False, inplace=True)
cv_hyperpara.head()
mean_fit_time std_fit_time mean_score_time std_score_time param_ridge__alpha params split0_test_score split1_test_score split2_test_score split3_test_score split4_test_score mean_test_score std_test_score rank_test_score
2 0.435438 0.191815 0.012485 0.007760 0.0010 {'ridge__alpha': 0.001} -2.474718 -3.311104 -1.991025 -1.705258 -2.041419 -2.304705 0.560142 1
1 0.283580 0.095317 0.017382 0.020987 0.0005 {'ridge__alpha': 0.0005} -2.446675 -3.707638 -1.855732 -1.656615 -1.986956 -2.330723 0.735874 2
3 0.173137 0.049710 0.019708 0.009537 0.0050 {'ridge__alpha': 0.005} -2.741799 -2.957761 -2.365332 -1.944842 -2.306915 -2.463330 0.353527 3
4 0.243274 0.081895 0.011746 0.004618 0.0100 {'ridge__alpha': 0.01} -2.922780 -2.943772 -2.528130 -2.102223 -2.319368 -2.563254 0.330852 4
0 0.169815 0.055283 0.011536 0.008210 0.0001 {'ridge__alpha': 0.0001} -2.794982 -4.791535 -1.634192 -1.668841 -2.564554 -2.690821 1.148973 5

The best model with \(\alpha=0.001\) has an average RMSE of 2.30 kJ/mol. This is better than our third degree polynomial (RMSE: 3.52 kJ/mol).

Lasso Regression#

Let us try lasso regression next which extends the cost function as follow:

\( J(\mathbf{w})=\underbrace{\frac{1}{m}\sum\limits_{i=1}^m\left(\mathbf{w}^T\mathbf{x}^{(i)}-y^{(i)}\right)^2}_{MSE}+\underbrace{2\alpha\sum\limits_{i=1}^n |w_i|}_{l1~norm} \)

An important characteristic of lasso regression is that it tends to eliminate the weights of the least important features.

from sklearn.linear_model import Lasso

num_pipeline = make_pipeline(PolynomialFeatures(degree=4, include_bias=True),StandardScaler())
cat_pipeline = make_pipeline(OneHotEncoder(sparse_output=False))

preprocessing = ColumnTransformer([("num",num_pipeline, make_column_selector(dtype_include=np.number)),
                                        ("cat",cat_pipeline, make_column_selector(dtype_include=object)),])

model_lasso = make_pipeline(preprocessing, Lasso(tol=0.001,max_iter=10000,random_state=42))

param_grid = [{'lasso__alpha': [0.01,0.05,0.1,0.5,1],
              }]
grid_search = GridSearchCV(model_lasso, param_grid, cv=5, scoring='neg_root_mean_squared_error')
grid_search.fit(X_train, y_train)

cv_hyperpara = pd.DataFrame(grid_search.cv_results_)
cv_hyperpara.sort_values(by="mean_test_score", ascending=False, inplace=True)
cv_hyperpara.head()
mean_fit_time std_fit_time mean_score_time std_score_time param_lasso__alpha params split0_test_score split1_test_score split2_test_score split3_test_score split4_test_score mean_test_score std_test_score rank_test_score
0 1.136063 0.194438 0.006329 0.000435 0.01 {'lasso__alpha': 0.01} -4.672010 -3.804662 -3.465088 -3.330916 -3.395455 -3.733626 0.496842 1
1 0.216253 0.020953 0.006121 0.000155 0.05 {'lasso__alpha': 0.05} -5.462308 -5.585104 -4.147108 -4.482182 -4.149371 -4.765214 0.632390 2
2 0.162689 0.034862 0.006765 0.000687 0.10 {'lasso__alpha': 0.1} -5.877349 -6.386720 -4.705775 -5.037777 -4.732187 -5.347962 0.670383 3
3 0.067492 0.010111 0.007745 0.001291 0.50 {'lasso__alpha': 0.5} -7.852872 -8.679819 -7.103944 -7.422412 -7.022961 -7.616402 0.606497 4
4 0.048799 0.006630 0.006862 0.000191 1.00 {'lasso__alpha': 1} -8.326920 -9.490296 -7.606208 -8.001640 -7.512350 -8.187483 0.713441 5

The best set of hyperparameters has on average a RMSE of 3.73 kJ/mol which is not as good as the third degree polynomial (RMSE: 3.52 kJ/mol).

ElasticNet Regression#

Elastic net is the combination of both previous regularization techniques. It has two hyperparameters: \(\alpha\) and the l1-ratio. The l1-ratio determines the weight of the l1 and l2 norm and can be between 0 and 1. In the example below, we will employ additionally an alternative to the grid search. RandomizedSearchCV will train models only for a number of randomly selected combinations of hyperparameters. Thus, it is efficient for a large number of hyperparameters but it might miss the optimal values.

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import RandomizedSearchCV

num_pipeline = make_pipeline(PolynomialFeatures(degree=4, include_bias=True),StandardScaler())
cat_pipeline = make_pipeline(OneHotEncoder(sparse_output=False))

preprocessing = ColumnTransformer([("num",num_pipeline, make_column_selector(dtype_include=np.number)),
                                        ("cat",cat_pipeline, make_column_selector(dtype_include=object)),])

model_elasticnet = make_pipeline(preprocessing, ElasticNet(max_iter=10000,tol=0.01,random_state=42))

param_distribs = {'elasticnet__alpha': [0.01,0.05,0.1,0.5,1],
                  'elasticnet__l1_ratio': [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]}

rnd_search = RandomizedSearchCV(model_elasticnet, param_distributions=param_distribs, n_iter=20, cv=5,
                                scoring='neg_root_mean_squared_error', random_state=42)
rnd_search.fit(X_train, y_train)

cv_hyperpara = pd.DataFrame(rnd_search.cv_results_)
cv_hyperpara.sort_values(by="mean_test_score", ascending=False, inplace=True)
cv_hyperpara.head()
mean_fit_time std_fit_time mean_score_time std_score_time param_elasticnet__l1_ratio param_elasticnet__alpha params split0_test_score split1_test_score split2_test_score split3_test_score split4_test_score mean_test_score std_test_score rank_test_score
9 0.143747 0.076474 0.007933 0.002688 0.4 0.01 {'elasticnet__l1_ratio': 0.4, 'elasticnet__alp... -5.132070 -4.486064 -3.815053 -3.763908 -3.765210 -4.192461 0.543637 1
6 0.082414 0.009320 0.006356 0.000242 0.5 0.01 {'elasticnet__l1_ratio': 0.5, 'elasticnet__alp... -5.144582 -4.374075 -3.841969 -3.824129 -3.791464 -4.195244 0.521304 2
10 0.059473 0.007288 0.006203 0.000154 0.7 0.01 {'elasticnet__l1_ratio': 0.7, 'elasticnet__alp... -5.256995 -4.363637 -3.873677 -3.893520 -3.835811 -4.244728 0.541676 3
8 0.046171 0.003873 0.006102 0.000098 0.9 0.01 {'elasticnet__l1_ratio': 0.9, 'elasticnet__alp... -5.325776 -4.433316 -3.929402 -3.976640 -3.888517 -4.310730 0.544176 4
14 0.042612 0.002461 0.005982 0.000127 0.9 0.05 {'elasticnet__l1_ratio': 0.9, 'elasticnet__alp... -5.629445 -5.646073 -4.293566 -4.542854 -4.297907 -4.881969 0.623686 5

The best set of hyperparameters has an average RMSE of 4.19 kJ/mol. Thus, the combination of both regularization techniques did not result in a better model than applying solely ridge regression.