Boosting (Regression)#
Boosting refers to any ensemble methods that combines several regression models to a better one. The general idea is to train predictors sequentially, each trying to correct its predecessor. The most popular are Adaboost and gradient boosting.
Adaboost#
AdaBoost trains an initial regression model. Subsequently, the obtained model is used to make predictions on the training set. This allows to determine the error of the model (e.g. \(|y_i-h_t(x_i)|\)) which will be used to update the old weights \(w_{i}^{t}\) as follow:
\( w_{i}^{t+1} = w_{i}^{t} \cdot \eta \cdot \left(\frac{e_t}{1-e_t}\right)^{1-L_i\left(\frac{|y_i-h_t(x_i)|}{\max|y_i-h_t(x_i)|}\right)} \)
where \(e_t\) is calculated by:
\( e_t = \sum\limits_{i=1}^m w_i^t \cdot L_i\left(\frac{|y_i-h_t(x_i)|}{\max|y_i-h_t(x_i)|}\right) \)
\(\eta\) is the learning rate and \(L_i\) is a loss function constrained on \([0,1]\). \(m\) is the number of instances of the training data. The initial weights \(w_i^1\) are \(\frac{1}{m}\). The updated and normalized weights are then used to train a new model. This can be repeated several times. The final predictions are made by a weighted median of all models.
Let us use Adaboost to improve our SVR model.
# code from previous notebooks of this section
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector
import numpy as np
hb_data = pd.read_csv('HB_data.csv')
train_set, test_set = train_test_split(hb_data, test_size=0.2, random_state=42)
y_train = train_set['energy']
X_train = train_set.drop(['energy'], axis=1)
hb_data = pd.read_csv('HB_data.csv')
train_set, test_set = train_test_split(hb_data, test_size=0.2, random_state=42)
num_pipeline = make_pipeline(StandardScaler())
cat_pipeline = make_pipeline(OneHotEncoder(sparse_output=False))
preprocessing = ColumnTransformer([("num",num_pipeline, make_column_selector(dtype_include=np.number)),
("cat",cat_pipeline, make_column_selector(dtype_include=object)),])
# end code from previous notebooks of this section
from sklearn.svm import SVR
from sklearn.ensemble import AdaBoostRegressor
model_adaboost = make_pipeline(preprocessing,AdaBoostRegressor(learning_rate=0.7,n_estimators=5,loss='linear',
estimator=SVR(kernel="rbf", C=5000, gamma='scale', epsilon=0.1),random_state=42))
scores = -cross_val_score(model_adaboost, X_train, y_train, scoring='neg_root_mean_squared_error', cv=5)
print(f"Root mean square error of each validation in kJ/mol:\n{scores}\n")
print("This is an average root mean square error of %0.2f kJ/mol with a standard deviation of %0.2f kJ/mol\n" % (scores.mean(), scores.std()))
Root mean square error of each validation in kJ/mol:
[1.87352819 2.14775678 1.38423592 1.29054251 1.59111137]
This is an average root mean square error of 1.66 kJ/mol with a standard deviation of 0.32 kJ/mol
Thus, Adaboost resulted in an improved SVR model. We have run in total 5 boosting cycles. Adaboost can capture complex pattern by adapting to difficult cases. It reduces the risk of overfitting by adjusting sample weights. However, Adaboost is sensitive to outliers and noisy data. Furthermore, it might struggle with imbalanced data sets.
Gradient Boosting#
Gradient boosting fit a new predictor to the residual errors made by the previous predictor. It uses decision trees in Scikit-learn but is not limited to this model.
from sklearn.ensemble import GradientBoostingRegressor
model_gradient = make_pipeline(preprocessing,GradientBoostingRegressor(init=SVR(kernel="rbf", gamma='scale', C=5000, epsilon=0.1),
random_state=42, learning_rate=0.1, n_estimators=70, max_depth=5, min_samples_split=2))
scores = -cross_val_score(model_gradient, X_train, y_train, scoring='neg_root_mean_squared_error', cv=5)
print(f"Root mean square error of each validation in kJ/mol:\n{scores}\n")
print("This is an average root mean square error of %0.2f kJ/mol with a standard deviation of %0.2f kJ/mol\n" % (scores.mean(), scores.std()))
Root mean square error of each validation in kJ/mol:
[1.42749115 1.51116803 1.21027719 1.11317383 1.16739953]
This is an average root mean square error of 1.29 kJ/mol with a standard deviation of 0.16 kJ/mol
This is our best model. The learning rate scales the contribution of each tree. A low value results in more trees to fit the training set but the predictions generalize commonly better. The hyperparameter “n_estimators” defines the number of boosting stages. The minimum number of samples required to split a node is set by min_samples_split. The maxinum number of nodes from the root node to the leaves for each tree is set by “max__depth”.