Regularized Linear Regression#
The question arises: Can we get a better fourth degree polynomial fit if we apply selected constrains to our weights during fitting? This can be done by regularized linear regression models where we will introduce three examples in more detail in the following.
Ridge Regression#
The first example is called ridge regression. The cost function \(J\) is extended as follow:
\( J(\mathbf{w})=\underbrace{\frac{1}{m}\sum\limits_{i=1}^m\left(\mathbf{w}^T\mathbf{x}^{(i)}-y^{(i)}\right)^2}_{MSE}+\underbrace{\frac{\alpha}{m}\sum\limits_{i=1}^n w_i^2}_{l2~norm} \)
The extension represents the l2 norm of the feature weights from 1 to \(n\). The bias term is not regularized since the sum over the features \(n\) starts at 1. Increasing the hyperparameter \(α\) leads to flatter predictions. Thus, the variance of the model is reduced and increasing \(\alpha\) to infinity will result in predicting the mean value of the training data independent of the input features. Let us check if we can get a better fourth degree polynomial by applying this technique:
# code from previous notebooks of this section
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector
from sklearn.model_selection import GridSearchCV
import numpy as np
hb_data = pd.read_csv('HB_data.csv')
train_set, test_set = train_test_split(hb_data, test_size=0.2, random_state=42)
y_train = train_set['energy']
X_train = train_set.drop(['energy'], axis=1)
num_pipeline = make_pipeline(PolynomialFeatures(degree=4, include_bias=True),StandardScaler())
cat_pipeline = make_pipeline(OneHotEncoder(sparse_output=False))
preprocessing = ColumnTransformer([("num",num_pipeline, make_column_selector(dtype_include=np.number)),
("cat",cat_pipeline, make_column_selector(dtype_include=object)),])
# end code from previous notebooks of this section
from sklearn.linear_model import Ridge
model_ridge = make_pipeline(preprocessing, Ridge(random_state=42))
param_grid = [{'ridge__alpha': [0.0001,0.0005,0.001,0.005,0.01,0.05,0.1,0.5,1],
}]
grid_search = GridSearchCV(model_ridge, param_grid, cv=5, scoring='neg_root_mean_squared_error')
grid_search.fit(X_train, y_train)
cv_hyperpara = pd.DataFrame(grid_search.cv_results_)
cv_hyperpara.sort_values(by="mean_test_score", ascending=False, inplace=True)
cv_hyperpara.head()
mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_ridge__alpha | params | split0_test_score | split1_test_score | split2_test_score | split3_test_score | split4_test_score | mean_test_score | std_test_score | rank_test_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 0.435438 | 0.191815 | 0.012485 | 0.007760 | 0.0010 | {'ridge__alpha': 0.001} | -2.474718 | -3.311104 | -1.991025 | -1.705258 | -2.041419 | -2.304705 | 0.560142 | 1 |
1 | 0.283580 | 0.095317 | 0.017382 | 0.020987 | 0.0005 | {'ridge__alpha': 0.0005} | -2.446675 | -3.707638 | -1.855732 | -1.656615 | -1.986956 | -2.330723 | 0.735874 | 2 |
3 | 0.173137 | 0.049710 | 0.019708 | 0.009537 | 0.0050 | {'ridge__alpha': 0.005} | -2.741799 | -2.957761 | -2.365332 | -1.944842 | -2.306915 | -2.463330 | 0.353527 | 3 |
4 | 0.243274 | 0.081895 | 0.011746 | 0.004618 | 0.0100 | {'ridge__alpha': 0.01} | -2.922780 | -2.943772 | -2.528130 | -2.102223 | -2.319368 | -2.563254 | 0.330852 | 4 |
0 | 0.169815 | 0.055283 | 0.011536 | 0.008210 | 0.0001 | {'ridge__alpha': 0.0001} | -2.794982 | -4.791535 | -1.634192 | -1.668841 | -2.564554 | -2.690821 | 1.148973 | 5 |
The best model with \(\alpha=0.001\) has an average RMSE of 2.30 kJ/mol. This is better than our third degree polynomial (RMSE: 3.52 kJ/mol).
Lasso Regression#
Let us try lasso regression next which extends the cost function as follow:
\( J(\mathbf{w})=\underbrace{\frac{1}{m}\sum\limits_{i=1}^m\left(\mathbf{w}^T\mathbf{x}^{(i)}-y^{(i)}\right)^2}_{MSE}+\underbrace{2\alpha\sum\limits_{i=1}^n |w_i|}_{l1~norm} \)
An important characteristic of lasso regression is that it tends to eliminate the weights of the least important features.
from sklearn.linear_model import Lasso
num_pipeline = make_pipeline(PolynomialFeatures(degree=4, include_bias=True),StandardScaler())
cat_pipeline = make_pipeline(OneHotEncoder(sparse_output=False))
preprocessing = ColumnTransformer([("num",num_pipeline, make_column_selector(dtype_include=np.number)),
("cat",cat_pipeline, make_column_selector(dtype_include=object)),])
model_lasso = make_pipeline(preprocessing, Lasso(tol=0.001,max_iter=10000,random_state=42))
param_grid = [{'lasso__alpha': [0.01,0.05,0.1,0.5,1],
}]
grid_search = GridSearchCV(model_lasso, param_grid, cv=5, scoring='neg_root_mean_squared_error')
grid_search.fit(X_train, y_train)
cv_hyperpara = pd.DataFrame(grid_search.cv_results_)
cv_hyperpara.sort_values(by="mean_test_score", ascending=False, inplace=True)
cv_hyperpara.head()
mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_lasso__alpha | params | split0_test_score | split1_test_score | split2_test_score | split3_test_score | split4_test_score | mean_test_score | std_test_score | rank_test_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.136063 | 0.194438 | 0.006329 | 0.000435 | 0.01 | {'lasso__alpha': 0.01} | -4.672010 | -3.804662 | -3.465088 | -3.330916 | -3.395455 | -3.733626 | 0.496842 | 1 |
1 | 0.216253 | 0.020953 | 0.006121 | 0.000155 | 0.05 | {'lasso__alpha': 0.05} | -5.462308 | -5.585104 | -4.147108 | -4.482182 | -4.149371 | -4.765214 | 0.632390 | 2 |
2 | 0.162689 | 0.034862 | 0.006765 | 0.000687 | 0.10 | {'lasso__alpha': 0.1} | -5.877349 | -6.386720 | -4.705775 | -5.037777 | -4.732187 | -5.347962 | 0.670383 | 3 |
3 | 0.067492 | 0.010111 | 0.007745 | 0.001291 | 0.50 | {'lasso__alpha': 0.5} | -7.852872 | -8.679819 | -7.103944 | -7.422412 | -7.022961 | -7.616402 | 0.606497 | 4 |
4 | 0.048799 | 0.006630 | 0.006862 | 0.000191 | 1.00 | {'lasso__alpha': 1} | -8.326920 | -9.490296 | -7.606208 | -8.001640 | -7.512350 | -8.187483 | 0.713441 | 5 |
The best set of hyperparameters has on average a RMSE of 3.73 kJ/mol which is not as good as the third degree polynomial (RMSE: 3.52 kJ/mol).
ElasticNet Regression#
Elastic net is the combination of both previous regularization techniques. It has two hyperparameters: \(\alpha\) and the l1-ratio. The l1-ratio determines the weight of the l1 and l2 norm and can be between 0 and 1. In the example below, we will employ additionally an alternative to the grid search. RandomizedSearchCV will train models only for a number of randomly selected combinations of hyperparameters. Thus, it is efficient for a large number of hyperparameters but it might miss the optimal values.
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import RandomizedSearchCV
num_pipeline = make_pipeline(PolynomialFeatures(degree=4, include_bias=True),StandardScaler())
cat_pipeline = make_pipeline(OneHotEncoder(sparse_output=False))
preprocessing = ColumnTransformer([("num",num_pipeline, make_column_selector(dtype_include=np.number)),
("cat",cat_pipeline, make_column_selector(dtype_include=object)),])
model_elasticnet = make_pipeline(preprocessing, ElasticNet(max_iter=10000,tol=0.01,random_state=42))
param_distribs = {'elasticnet__alpha': [0.01,0.05,0.1,0.5,1],
'elasticnet__l1_ratio': [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]}
rnd_search = RandomizedSearchCV(model_elasticnet, param_distributions=param_distribs, n_iter=20, cv=5,
scoring='neg_root_mean_squared_error', random_state=42)
rnd_search.fit(X_train, y_train)
cv_hyperpara = pd.DataFrame(rnd_search.cv_results_)
cv_hyperpara.sort_values(by="mean_test_score", ascending=False, inplace=True)
cv_hyperpara.head()
mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_elasticnet__l1_ratio | param_elasticnet__alpha | params | split0_test_score | split1_test_score | split2_test_score | split3_test_score | split4_test_score | mean_test_score | std_test_score | rank_test_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9 | 0.143747 | 0.076474 | 0.007933 | 0.002688 | 0.4 | 0.01 | {'elasticnet__l1_ratio': 0.4, 'elasticnet__alp... | -5.132070 | -4.486064 | -3.815053 | -3.763908 | -3.765210 | -4.192461 | 0.543637 | 1 |
6 | 0.082414 | 0.009320 | 0.006356 | 0.000242 | 0.5 | 0.01 | {'elasticnet__l1_ratio': 0.5, 'elasticnet__alp... | -5.144582 | -4.374075 | -3.841969 | -3.824129 | -3.791464 | -4.195244 | 0.521304 | 2 |
10 | 0.059473 | 0.007288 | 0.006203 | 0.000154 | 0.7 | 0.01 | {'elasticnet__l1_ratio': 0.7, 'elasticnet__alp... | -5.256995 | -4.363637 | -3.873677 | -3.893520 | -3.835811 | -4.244728 | 0.541676 | 3 |
8 | 0.046171 | 0.003873 | 0.006102 | 0.000098 | 0.9 | 0.01 | {'elasticnet__l1_ratio': 0.9, 'elasticnet__alp... | -5.325776 | -4.433316 | -3.929402 | -3.976640 | -3.888517 | -4.310730 | 0.544176 | 4 |
14 | 0.042612 | 0.002461 | 0.005982 | 0.000127 | 0.9 | 0.05 | {'elasticnet__l1_ratio': 0.9, 'elasticnet__alp... | -5.629445 | -5.646073 | -4.293566 | -4.542854 | -4.297907 | -4.881969 | 0.623686 | 5 |
The best set of hyperparameters has an average RMSE of 4.19 kJ/mol. Thus, the combination of both regularization techniques did not result in a better model than applying solely ridge regression.