Training a binary classifier

Training a binary classifier#

We have prepared a data set for this session which consists of two input features and two classes as label. You can open it by:

import pandas as pd
data = pd.read_csv('cl1_data.csv')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2500 entries, 0 to 2499
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   feature1  2500 non-null   float64
 1   feature2  2500 non-null   float64
 2   label     2500 non-null   object 
dtypes: float64(2), object(1)
memory usage: 58.7+ KB

We can check the number of data points per label by:

data["label"].value_counts()

label
group1    2000
group2     500
Name: count, dtype: int64

We have two labels, “group2” has 500 instances while “group1” has 2000. Since it is a supervised technique, we will split our data set in a test and training data set:

from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(data, test_size=0.2, random_state=42)

Next, we scale our input features, similar as for the regression models. We will use a pipeline known from the regression models:

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector
import numpy as np

y_train = train_set['label']
X_train = train_set.drop(['label'], axis=1)

num_pipeline = make_pipeline(StandardScaler()) 

preprocessing = ColumnTransformer([("num",num_pipeline, make_column_selector(dtype_include=np.number))])

We can train a classification algorithm now. We will use a support vector machine (SVM) with a RBF kernel. In case of classification, SVM separates two classes by the widest possible street between two classes. If all instances are off the street, it is called hard margin classification. However, this will fail for overlapping classes (e.g. by outliers) which are not linear separable even after feature transformation. Soft margin classification allows instances to be on the street separating two classes. The hyperparamter C affect margin violations. A small value of C allows more margin violations than a large C. Thus, reducing C too much results in too many instances on the street and underfitting. Increasing C too much might result in overfitting. Let us train a SVM with the default hyperparameters first:

from sklearn.svm import SVC

model_svc = make_pipeline(preprocessing, SVC(kernel='rbf', C=1.0)) 
model_svc.fit(X_train, y_train)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('standardscaler',
                                                                   StandardScaler())]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x799a2ffbb0e0>)])),
                ('svc', SVC())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Pipeline

?Documentation for PipelineiFitted

Parameters

	steps	[('columntransformer', ...), ('svc', ...)]
	transform_input	None
	memory	None
	verbose	False

columntransformer: ColumnTransformer

?Documentation for columntransformer: ColumnTransformer

Parameters

	transformers	[('num', ...)]
	remainder	'drop'
	sparse_threshold	0.3
	n_jobs	None
	transformer_weights	None
	verbose	False
	verbose_feature_names_out	True
	force_int_remainder_cols	'deprecated'

num

<sklearn.compose._column_transformer.make_column_selector object at 0x799a2ffbb0e0>

StandardScaler

?Documentation for StandardScaler

Parameters

	copy	True
	with_mean	True
	with_std	True

SVC

?Documentation for SVC

Parameters

	C	1.0
	kernel	'rbf'
	degree	3
	gamma	'scale'
	coef0	0.0
	shrinking	True
	probability	False
	tol	0.001
	cache_size	200
	class_weight	None
	verbose	False
	max_iter	-1
	decision_function_shape	'ovr'
	break_ties	False
	random_state	None

We can use our trained model to check if a provided data point is part of the cluster or not:

check_model_svc = pd.DataFrame([[0.6, 0.3],[0.7,0.4]], columns=['feature1', 'feature2'])
model_svc.predict(check_model_svc)

array(['group2', 'group1'], dtype=object)

Similar to the section on regression, the next step in model development is to fine-tune the hyperparameters, e.g. using grid search. However, this requires that you are familiar with performance measures of classification. Thus, we will introduce them next.