Multiclass Classification#
Multiclass classifier can distinguish between more than two classes. Some classifier are real multiclass classifiers, for example random forests. Other classifiers are limited to binary classification, for example support vector machines. However, you can use a binary classifier also for multiclass classification by applying multiple binary classifiers.
The one-over-the-rest strategy trains a binary classifier for each class. If you have \(N\)-classes, \(N\) binary classifier must be trained. This gives access to a decision score for each class. The highest probability in all trained binary classifiers is used to predict the label. This strategy is most often employed for binary classifiers.
The one-versus-one strategy trains a binary classifier for each pair of classes. Thus, \(0.5(N^2-N)\) classifiers are needed for \(N\) classes. The predicted class is based on the one which wins most direct comparisons. The main advantage of this approach is that each binary classifier need solely the training instances of the two target classes it should distinguish during training. Thus, this strategy is reasonable for classifiers showing a poor computational scaling with the number of instances \(m\).
Scikit-learn supports you by selecting automatically the most suited strategy for a binary classifier to be applied for a multiclass problem. Let us train a support vector machine for a problem with the classes “group1”, “group2”, “group3”, “group4” and “group5”:
import pandas as pd
data = pd.read_csv('cl2_data.csv')
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2500 entries, 0 to 2499
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 feature1 2500 non-null float64
1 feature2 2500 non-null float64
2 label 2500 non-null object
dtypes: float64(2), object(1)
memory usage: 58.7+ KB
data["label"].value_counts()
label
group1 500
group2 500
group3 500
group4 500
group5 500
Name: count, dtype: int64
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector
import numpy as np
from sklearn.svm import SVC
train_set, test_set = train_test_split(data, test_size=0.2, random_state=42)
y_train = train_set['label']
X_train = train_set.drop(['label'], axis=1)
num_pipeline = make_pipeline(StandardScaler())
preprocessing = ColumnTransformer([("num",num_pipeline, make_column_selector(dtype_include=np.number))])
model_svc = make_pipeline(preprocessing, SVC(kernel='rbf', C=1.0, random_state=42))
model_svc.fit(X_train, y_train)
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('num', Pipeline(steps=[('standardscaler', StandardScaler())]), <sklearn.compose._column_transformer.make_column_selector object at 0x7024f5217fe0>)])), ('svc', SVC(random_state=42))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
steps | [('columntransformer', ...), ('svc', ...)] | |
transform_input | None | |
memory | None | |
verbose | False |
Parameters
transformers | [('num', ...)] | |
remainder | 'drop' | |
sparse_threshold | 0.3 | |
n_jobs | None | |
transformer_weights | None | |
verbose | False | |
verbose_feature_names_out | True | |
force_int_remainder_cols | 'deprecated' |
<sklearn.compose._column_transformer.make_column_selector object at 0x7024f5217fe0>
Parameters
copy | True | |
with_mean | True | |
with_std | True |
Parameters
C | 1.0 | |
kernel | 'rbf' | |
degree | 3 | |
gamma | 'scale' | |
coef0 | 0.0 | |
shrinking | True | |
probability | False | |
tol | 0.001 | |
cache_size | 200 | |
class_weight | None | |
verbose | False | |
max_iter | -1 | |
decision_function_shape | 'ovr' | |
break_ties | False | |
random_state | 42 |
We can use our model to predict the label for a given set of input features:
check_model_svc = pd.DataFrame([[0.6, 0.3],[0.7,0.4]], columns=['feature1', 'feature2'])
model_svc.predict(check_model_svc)
array(['group5', 'group2'], dtype=object)
In case of SVC, scikit-learn uses the one-versus-one strategy. Remember: Support vector machines with the kernel trick scale between \(O(m^2\cdot n)\) and \(O(m^3\cdot n)\). Thus, SVC is getting really slow for data sets with a large number of instances \(m\) due to poor computational scaling. Therefore, the one-versus-one strategy is reasonable since it allows to reduce the number of instances during training of a single classifier. The decision function will give a list of won duels plus or minus a small tweak (max ±0.33). The random tweak is used to break ties. Therefore, the parameter “random_number” in our SVC is essential for reproducible results.
check_scores_svc = model_svc.decision_function(check_model_svc)
print(check_scores_svc.round(2))
[[ 1.91 3.23 -0.27 0.73 4.29]
[-0.28 4.29 1.76 0.75 3.27]]
We can use following code to check which class belongs to a given score in the list:
model_svc.classes_
array(['group1', 'group2', 'group3', 'group4', 'group5'], dtype=object)
If we want to change the classifier strategy, we can do this with the OneVsOneClassifier or OneVsRestClassifier:
from sklearn.multiclass import OneVsRestClassifier
model_svc_ovr = make_pipeline(preprocessing, OneVsRestClassifier(SVC(kernel='rbf', C=1.0, random_state=42)))
model_svc_ovr.fit(X_train, y_train)
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('num', Pipeline(steps=[('standardscaler', StandardScaler())]), <sklearn.compose._column_transformer.make_column_selector object at 0x7024f5217fe0>)])), ('onevsrestclassifier', OneVsRestClassifier(estimator=SVC(random_state=42)))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
steps | [('columntransformer', ...), ('onevsrestclassifier', ...)] | |
transform_input | None | |
memory | None | |
verbose | False |
Parameters
transformers | [('num', ...)] | |
remainder | 'drop' | |
sparse_threshold | 0.3 | |
n_jobs | None | |
transformer_weights | None | |
verbose | False | |
verbose_feature_names_out | True | |
force_int_remainder_cols | 'deprecated' |
<sklearn.compose._column_transformer.make_column_selector object at 0x7024f5217fe0>
Parameters
copy | True | |
with_mean | True | |
with_std | True |
Parameters
estimator | SVC(random_state=42) | |
n_jobs | None | |
verbose | 0 |
SVC(random_state=42)
Parameters
C | 1.0 | |
kernel | 'rbf' | |
degree | 3 | |
gamma | 'scale' | |
coef0 | 0.0 | |
shrinking | True | |
probability | False | |
tol | 0.001 | |
cache_size | 200 | |
class_weight | None | |
verbose | False | |
max_iter | -1 | |
decision_function_shape | 'ovr' | |
break_ties | False | |
random_state | 42 |
We get the distance to the decision boundary for each classifier by the decision function:
model_svc_ovr.decision_function(check_model_svc).round(2)
array([[-1.97, -1.32, -2.19, -2.72, 1.02],
[-2.57, 0.92, -3.38, -2.55, -0.78]])
We can carry out also a cross validation for both strategies and compare the \(F_1\) score:
from sklearn.model_selection import cross_val_score
f1_svc_ovo = cross_val_score(model_svc, X_train, y_train, cv=5, scoring="f1_weighted")
print(f"F1 score of each subset of the cross validation for OvO:\n{f1_svc_ovo}\n")
print("This is an average F1 score of %0.3f for OvO.\n" % (f1_svc_ovo.mean()))
f1_svc_ovr = cross_val_score(model_svc_ovr, X_train, y_train, cv=5, scoring="f1_weighted")
print(f"F1 score of each subset of the cross validation for OvR:\n{f1_svc_ovr}\n")
print("This is an average F1 score of %0.3f for OvR.\n" % (f1_svc_ovr.mean()))
F1 score of each subset of the cross validation for OvO:
[0.92790759 0.94253014 0.93509162 0.95261094 0.91751834]
This is an average F1 score of 0.935 for OvO.
F1 score of each subset of the cross validation for OvR:
[0.93288525 0.94777851 0.93999501 0.94761709 0.91740247]
This is an average F1 score of 0.937 for OvR.
Please note, “f1_weighted” take a weighted average, where the weight is based on the number of instances in each class. If you have imbalanced classes and want to give each class the same weight, use “f1_macro” instead.