Training a binary classifier#
We have prepared a data set for this session which consists of two input features and two classes as label. You can open it by:
import pandas as pd
data = pd.read_csv('cl1_data.csv')
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2500 entries, 0 to 2499
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 feature1 2500 non-null float64
1 feature2 2500 non-null float64
2 label 2500 non-null object
dtypes: float64(2), object(1)
memory usage: 58.7+ KB
We can check the number of data points per label by:
data["label"].value_counts()
label
group1 2000
group2 500
Name: count, dtype: int64
We have two labels, “group2” has 500 instances while “group1” has 2000. Since it is a supervised technique, we will split our data set in a test and training data set:
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(data, test_size=0.2, random_state=42)
Next, we scale our input features, similar as for the regression models. We will use a pipeline known from the regression models:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector
import numpy as np
y_train = train_set['label']
X_train = train_set.drop(['label'], axis=1)
num_pipeline = make_pipeline(StandardScaler())
preprocessing = ColumnTransformer([("num",num_pipeline, make_column_selector(dtype_include=np.number))])
We can train a classification algorithm now. We will use a support vector machine (SVM) with a RBF kernel. In case of classification, SVM separates two classes by the widest possible street between two classes. If all instances are off the street, it is called hard margin classification. However, this will fail for overlapping classes (e.g. by outliers) which are not linear separable even after feature transformation. Soft margin classification allows instances to be on the street separating two classes. The hyperparamter C affect margin violations. A small value of C allows more margin violations than a large C. Thus, reducing C too much results in too many instances on the street and underfitting. Increasing C too much might result in overfitting. Let us train a SVM with the default hyperparameters first:
from sklearn.svm import SVC
model_svc = make_pipeline(preprocessing, SVC(kernel='rbf', C=1.0))
model_svc.fit(X_train, y_train)
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('num', Pipeline(steps=[('standardscaler', StandardScaler())]), <sklearn.compose._column_transformer.make_column_selector object at 0x799a2ffbb0e0>)])), ('svc', SVC())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
steps | [('columntransformer', ...), ('svc', ...)] | |
transform_input | None | |
memory | None | |
verbose | False |
Parameters
transformers | [('num', ...)] | |
remainder | 'drop' | |
sparse_threshold | 0.3 | |
n_jobs | None | |
transformer_weights | None | |
verbose | False | |
verbose_feature_names_out | True | |
force_int_remainder_cols | 'deprecated' |
<sklearn.compose._column_transformer.make_column_selector object at 0x799a2ffbb0e0>
Parameters
copy | True | |
with_mean | True | |
with_std | True |
Parameters
C | 1.0 | |
kernel | 'rbf' | |
degree | 3 | |
gamma | 'scale' | |
coef0 | 0.0 | |
shrinking | True | |
probability | False | |
tol | 0.001 | |
cache_size | 200 | |
class_weight | None | |
verbose | False | |
max_iter | -1 | |
decision_function_shape | 'ovr' | |
break_ties | False | |
random_state | None |
We can use our trained model to check if a provided data point is part of the cluster or not:
check_model_svc = pd.DataFrame([[0.6, 0.3],[0.7,0.4]], columns=['feature1', 'feature2'])
model_svc.predict(check_model_svc)
array(['group2', 'group1'], dtype=object)
Similar to the section on regression, the next step in model development is to fine-tune the hyperparameters, e.g. using grid search. However, this requires that you are familiar with performance measures of classification. Thus, we will introduce them next.