Comprehensive Exercise on Segmentation Analysis - Solution

Comprehensive Exercise on Segmentation Analysis - Solution#

Spoiler:
This notebook contains a comprehensive step-by-step solution to the exercise.
Try working through the exercise on your own first before looking at this suggested solution.

One extremely helpful application of Machine Learning is in data exploration and segementation analysis. Clustering algorithms help us find fundamental structures in data that is unknown to us, e.g., whether individual data points have any similarities that we do not yet recognize.

Let’s try this out on a dataset about demographics, health metrics, and engagement patterns of patients.

We will use the Patient Segmentation Dataset from Kaggle (CC0: Public Domain).

The dataset contains 2000 patient records with comprehensive information including:

Demographics like age, gender, geographic location
Health metrics like height, weight, chronic conditions
Healthcare utilization like annual visit frequency
Insurance & engagement like insurance type, preventive care participation flag

The goal is to apply clustering on the patient data for a so-called Segmentation Analysis for identifying distinct patient personas (e.g., “High-risk seniors”, “Young healthy adults”, “Chronic disease patients”).

Your tasks:

Load the data from the file patient_segmentation_dataset.csv
Inspect the data and make yourself familiar with its content, data types, distributions, etc.
Decide for a fitting clustering algorithm based on your data insights (e.g., can KMeans be used for this specific data?)
Prepare the data for clustering and apply the clustering algorithm. Compare different clustering algorithms.
Explore the clusters (patient segments) to gain helpful insights

# Import bia-bob as a helpful Python & Medical AI expert
from bia_bob import bob
import os

bob.initialize(
    endpoint=os.getenv('ENDPOINT_URL'),
    model="vllm-llama-4-scout-17b-16e-instruct",
    system_prompt=os.getenv('SYSTEM_PROMPT_MEDICAL_AI')
)

This notebook may contain text, code and images generated by artificial intelligence. Used model: vllm-llama-4-scout-17b-16e-instruct, vision model: None, endpoint: https://kiara.sc.uni-leipzig.de/api/v1, bia-bob version: 0.34.3.. Do not enter sensitive or private information and verify generated contents according to good scientific practice. Read more: https://github.com/haesleinhuepf/bia-bob#disclaimer

%bob Who are you? Just one sentence!

I’m an expert in medical data science and a skilled Python programmer and data analyst with extensive experience working with various medical datasets and applying data analysis, machine learning, and deep learning techniques.

Analysis Plan#

We approach this analysis in structured steps:

Load & inspect the data: understand shape, types, distributions, and data quality
Exploratory Data Analysis (EDA): visualize distributions and relationships
Feature selection & preprocessing: decide which features to use and how to prepare them, encode categoricals, scale numerics
Algorithm selection and cluster profiling: apply and compare different clustering algorithms, profile each cluster to identify patient personas
Visualize and compare clusters: use dimensionality reduction (PCA, t-SNE) for 2D visualization
Wrap-up: summarize findings and discuss limitations

Step 1: Load and Inspect the Data#

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
# Notes:
# - set PatientID as dataframe index: index_col=[colum index]
# - convert columns with dates to datatime format: parse_dates=[colum index]
# - prevents pandas from interpreting the string "None" as a missing value (NaN): keep_default_na=False
df = pd.read_csv('patient_segmentation_dataset.csv', sep=',', index_col=[0], parse_dates=[13], keep_default_na=False)

df.info()

<class 'pandas.DataFrame'>
Index: 2000 entries, P10000 to P11999
Data columns (total 15 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   Age                     2000 non-null   int64         
 1   Gender                  2000 non-null   str           
 2   State                   2000 non-null   str           
 3   City                    2000 non-null   str           
 4   Height_cm               2000 non-null   int64         
 5   Weight_kg               2000 non-null   int64         
 6   BMI                     2000 non-null   float64       
 7   Insurance_Type          2000 non-null   str           
 8   Primary_Condition       2000 non-null   str           
 9   Num_Chronic_Conditions  2000 non-null   int64         
 10  Annual_Visits           2000 non-null   int64         
 11  Avg_Billing_Amount      2000 non-null   float64       
 12  Last_Visit_Date         2000 non-null   datetime64[us]
 13  Days_Since_Last_Visit   2000 non-null   int64         
 14  Preventive_Care_Flag    2000 non-null   int64         
dtypes: datetime64[us](1), float64(2), int64(7), str(5)
memory usage: 250.0+ KB

df.head(10)

	Age	Gender	State	City	Height_cm	Weight_kg	BMI	Insurance_Type	Primary_Condition	Num_Chronic_Conditions	Annual_Visits	Avg_Billing_Amount	Last_Visit_Date	Days_Since_Last_Visit	Preventive_Care_Flag
PatientID
P10000	64	Male	GA	Unknown	151	115	50.4	Private	Arthritis	3	7	2995.0	2025-07-18	186	0
P10001	59	Male	OH	Unknown	189	68	19.0	Medicare	Depression	1	8	1209.0	2025-12-12	39	0
P10002	58	Female	PA	Unknown	156	91	37.4	Private	Asthma	1	4	999.0	2025-09-16	126	0
P10003	43	Female	GA	Unknown	152	92	39.8	Medicare	Hypertension	1	6	5638.5	2025-04-09	286	1
P10004	53	Female	NC	Unknown	167	51	18.3	Medicaid	Asthma	1	4	5796.0	2025-03-07	319	0
P10005	73	Male	OH	Unknown	160	120	46.9	Medicare	Anxiety	3	9	2397.5	2025-05-13	252	1
P10006	34	Male	PA	Unknown	161	51	19.7	Private	None	0	3	4668.0	2025-06-11	223	1
P10007	64	Female	MI	Unknown	177	102	32.6	Private	Arthritis	3	1	9942.5	2025-05-13	252	0
P10008	61	Female	PA	Unknown	171	86	29.4	Private	Arthritis	2	7	1036.0	2025-05-14	251	1
P10009	34	Male	FL	Tampa	182	111	33.5	Medicare	None	0	1	4602.0	2026-01-09	11	0

# Summary statistics for numeric columns
df.describe(include=['int64', 'float64']).T

	count	mean	std	min	25%	50%	75%	max
Age	2000.0	50.69550	15.444450	18.0	40.0	51.00	63.250	87.0
Height_cm	2000.0	167.90750	12.684494	145.0	158.0	168.00	177.000	195.0
Weight_kg	2000.0	85.14350	20.385428	50.0	67.0	86.00	103.000	120.0
BMI	2000.0	30.74065	8.839952	13.4	23.8	30.10	37.025	57.1
Num_Chronic_Conditions	2000.0	1.08000	0.890504	0.0	1.0	1.00	1.000	3.0
Annual_Visits	2000.0	5.46650	3.485965	1.0	3.0	4.00	8.000	12.0
Avg_Billing_Amount	2000.0	4000.27050	2463.239215	207.0	2061.0	3707.25	5650.875	12467.5
Days_Since_Last_Visit	2000.0	180.08500	104.688484	1.0	90.0	183.00	268.000	365.0
Preventive_Care_Flag	2000.0	0.46400	0.498827	0.0	0.0	0.00	1.000	1.0

# Summary statistics for categorical columns
df.describe(include='str').T

	count	unique	top	freq
Gender	2000	2	Female	1001
State	2000	10	NC	213
City	2000	20	Unknown	1012
Insurance_Type	2000	4	Medicare	906
Primary_Condition	2000	10	None	495

Observations from Data Inspection#

No missing values
Mixed data types: We have numerical features (Age, BMI, Avg_Billing_Amount, …), categorical features (Gender, Insurance_Type, Primary_Condition), and binary features (Preventive_Care_Flag).
Columns to potentially exclude from clustering: Patient_ID (unique identifier), State / City (high cardinality, many ‘Unknown’), Last_Visit_Date (high cardinality, redundant with Days_Since_Last_Visit).
The string "None" in Primary_Condition represents patients without a chronic condition. This is a valid category, not a missing value.

# Collect numerical feature names
numerical_features = df.select_dtypes(include=['int64', 'float64']).columns
print('Numerical features:\n', list(numerical_features))

# Collect categorical feature names
categorical_features = [feat for feat in df.select_dtypes(include=['str']).columns if feat not in ['State', 'City']]
print('Categorical features:\n', list(categorical_features))

Numerical features:
 ['Age', 'Height_cm', 'Weight_kg', 'BMI', 'Num_Chronic_Conditions', 'Annual_Visits', 'Avg_Billing_Amount', 'Days_Since_Last_Visit', 'Preventive_Care_Flag']
Categorical features:
 ['Gender', 'Insurance_Type', 'Primary_Condition']

Step 2: Exploratory Data Analysis (EDA)#

# Distribution of numerical features using histogram plots
fig, axes = plt.subplots(3, 3, figsize=(12, 9))

for ax, col in zip(axes.flatten(), numerical_features):
    sns.histplot(data=df, x=col, bins=30, ax=ax)

plt.suptitle('Distribution of Numeric Features', fontsize=12)
plt.tight_layout()
plt.show()

../_images/d62a569ae8516b9d3a5e4f52a6d1c4590112475b49c1c337aceb5680f5ed051f.png

# Distribution of categorical features using count plots
fig, axes = plt.subplots(1, 3, figsize=(12, 4))

for ax, col in zip(axes.flatten(), categorical_features):
    sns.countplot(data=df, x=col, ax=ax)
    ax.tick_params(axis='x', rotation=90)

plt.suptitle('Distribution of Categorical Features', fontsize=12)
plt.tight_layout()
plt.show()

../_images/0e74729c8b0559a653225a0f93931af5fa43c7b85cf556541992b56d00087146.png

# Correlation matrix for numerical features
fig, ax = plt.subplots(figsize=(8, 8))

corr = df[numerical_features].corr()
mask = np.triu(np.ones_like(corr, dtype=bool))

sns.heatmap(corr, mask=mask, annot=True, cmap='RdBu_r', center=0, square=True, linewidths=0.5, ax=ax)

ax.set_title('Correlation Matrix of Numeric Features', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

../_images/4da27dcf3a52b2a1178a726dc674ecff5ef2c178ce52522d59bf6845a62d8513.png

# Plot relationships: Age vs. Avg Billing Amount, colored by Primary Condition
fig, axes = plt.subplots(figsize=(8, 5))

sns.scatterplot(data=df, x='Age', y='Avg_Billing_Amount', hue='Primary_Condition', ax=axes)
axes.set_title('Age vs Avg_Billing_Amount vs Primary_Condition')

plt.tight_layout()
plt.show()

../_images/0f66aabc5003e39a51a221a5110c96f8f8357e6707942cce96536a7d39bca367.png

# Plot relationships: Age vs. Avg Billing Amount, colored by Insurance Type
fig, axes = plt.subplots(figsize=(8, 5))

sns.scatterplot(data=df, x='Age', y='Avg_Billing_Amount', hue='Insurance_Type', ax=axes)
axes.set_title('Age vs Avg_Billing_Amount vs Insurance_Type')

plt.tight_layout()
plt.show()

../_images/45623560c4108e1b67ecfc4952c85b16b69a8d814cfbb5030756f7ac14463ad9.png

Key EDA Insights#

Age and Gender are roughly uniformly distributed
Age, Num_Chronic_Conditions, Anual_Visits, and Avg_Billing_Amount are positively correlated – older patients visit more, have primary conditions, and cost more.
BMI is derived from Height and Weight, so including all three would introduce redundancy. We should consider dropping Height and Weight, or BMI.
Patients under 40 do not report any primary condition
Younger patients under the age of 30 are more likely to pay out of pocket, while patients over the age of 65 are insured exclusively through Medicare.

Step 3: Feature Selection and Preprocessing#

Feature decisions#

Include:

Age – key demographic
BMI – derived health metric (we drop Height_cm and Weight_kg to avoid redundancy)
Num_Chronic_Conditions – health complexity indicator
Annual_Visits – healthcare utilization
Avg_Billing_Amount – healthcare expenditure
Days_Since_Last_Visit – recency of engagement
Preventive_Care_Flag – engagement indicator (binary categorical with 0 / 1)
Gender – key demographic (categorical, should be one-hot encoded)
Insurance_Type – access indicator (categorical, should be one-hot encoded)
Primary_Condition – health status (categorical, should be one-hot encoded)

Exclude:

Patient_ID – identifier, not a feature
State, City – high cardinality geographic data with many unknowns
Last_Visit_Date – redundant with Days_Since_Last_Visit
Height_cm, Weight_kg – redundant with BMI

# Select features for clustering
# Remove 'Height_cm' and 'Weight_kg' from numeric features
numerical_features_clustering = [feat for feat in numerical_features if feat not in ['Height_cm', 'Weight_kg']]
categorical_features_clustering = categorical_features

# Create a subset with selected features
df_features = df[numerical_features_clustering + categorical_features_clustering].copy()
print(f"Selected features for clustering: {len(df_features.columns)} columns")
df_features.head()

Selected features for clustering: 10 columns

	Age	BMI	Num_Chronic_Conditions	Annual_Visits	Avg_Billing_Amount	Days_Since_Last_Visit	Preventive_Care_Flag	Gender	Insurance_Type	Primary_Condition
PatientID
P10000	64	50.4	3	7	2995.0	186	0	Male	Private	Arthritis
P10001	59	19.0	1	8	1209.0	39	0	Male	Medicare	Depression
P10002	58	37.4	1	4	999.0	126	0	Female	Private	Asthma
P10003	43	39.8	1	6	5638.5	286	1	Female	Medicare	Hypertension
P10004	53	18.3	1	4	5796.0	319	0	Female	Medicaid	Asthma

# One-hot encoding for categorical features
df_encoded = pd.get_dummies(df_features, columns=categorical_features_clustering, drop_first=False)

print(f"Shape after encoding: {df_encoded.shape}")
print(f"\nEncoded feature names:")
for i, col in enumerate(df_encoded.columns):
    print(f"  {i+1:2d}. {col}")

df_encoded.head()

Shape after encoding: (2000, 23)

Encoded feature names:
Age
BMI
Num_Chronic_Conditions
Annual_Visits
Avg_Billing_Amount
Days_Since_Last_Visit
Preventive_Care_Flag
Gender_Female
Gender_Male
Insurance_Type_Medicaid
Insurance_Type_Medicare
Insurance_Type_Private
Insurance_Type_Self-Pay
Primary_Condition_Anxiety
Primary_Condition_Arthritis
Primary_Condition_Asthma
Primary_Condition_COPD
Primary_Condition_Depression
Primary_Condition_Diabetes
Primary_Condition_Heart Disease
Primary_Condition_Hypertension
Primary_Condition_None
Primary_Condition_Obesity

	Age	BMI	Num_Chronic_Conditions	Annual_Visits	Avg_Billing_Amount	Days_Since_Last_Visit	Preventive_Care_Flag	Gender_Female	Gender_Male	Insurance_Type_Medicaid	...	Primary_Condition_Anxiety	Primary_Condition_Arthritis	Primary_Condition_Asthma	Primary_Condition_COPD	Primary_Condition_Depression	Primary_Condition_Diabetes	Primary_Condition_Heart Disease	Primary_Condition_Hypertension	Primary_Condition_None	Primary_Condition_Obesity
PatientID
P10000	64	50.4	3	7	2995.0	186	0	False	True	False	...	False	True	False	False	False	False	False	False	False	False
P10001	59	19.0	1	8	1209.0	39	0	False	True	False	...	False	False	False	False	True	False	False	False	False	False
P10002	58	37.4	1	4	999.0	126	0	True	False	False	...	False	False	True	False	False	False	False	False	False	False
P10003	43	39.8	1	6	5638.5	286	1	True	False	False	...	False	False	False	False	False	False	False	True	False	False
P10004	53	18.3	1	4	5796.0	319	0	True	False	True	...	False	False	True	False	False	False	False	False	False	False

5 rows × 23 columns

Why Scaling?#

Some clustering algorithms use distances, e.g. KMeans uses Euclidean distance. Without scaling, features with large ranges (e.g., Avg_Billing_Amount) would dominate the distance calculation over features with small ranges (e.g., Preventive_Care_Flag being 0 or 1). StandardScaler transforms each feature to have mean=0 and std=1, giving all features equal influence.

# Standardize all features 
from sklearn.preprocessing import StandardScaler

print("Statistics of first 5 features before scaling:\n", 
      df_encoded.iloc[:, :5].describe().T[['mean', 'std']].round(4))

scaler = StandardScaler()
X_scaled = scaler.fit_transform(df_encoded)

print("\nStatistics of first 5 features after scaling:\n", 
      pd.DataFrame(X_scaled, columns=df_encoded.columns).iloc[:, :5].describe().T[['mean', 'std']].round(4))

Statistics of first 5 features before scaling:
                              mean        std
Age                       50.6955    15.4445
BMI                       30.7406     8.8400
Num_Chronic_Conditions     1.0800     0.8905
Annual_Visits              5.4665     3.4860
Avg_Billing_Amount      4000.2705  2463.2392

Statistics of first 5 features after scaling:
                         mean     std
Age                     -0.0  1.0003
BMI                     -0.0  1.0003
Num_Chronic_Conditions  -0.0  1.0003
Annual_Visits           -0.0  1.0003
Avg_Billing_Amount      -0.0  1.0003

Step 4: Algorithm Selection#

Overview of Clustering Algorithms#

We apply and compare three clustering algorithms from scikit-learn on the same preprocessed and scaled patient data:

Algorithm	Type	Requires K?	Handles Noise?	Shape Assumption	Key Hyperparameter(s)
KMeans	Centroid-based	Yes	No - all points assigned	Spherical, equal-size	`n_clusters`
DBSCAN	Density-based	Auto-discovered	Yes - noise labeled -1	Arbitrary	`eps`, `min_samples`
MeanShift	Kernel density-based	Auto-discovered	No - all points assigned	Arbitrary	`bandwidth`

How each algorithm works:

KMeans: Assigns each point to the nearest centroid and iterates until convergence. Fast and interpretable, but requires specifying K and assumes convex, equally-sized clusters, and euclidean feature space.
DBSCAN: Identifies clusters as dense regions separated by sparser areas. Points in low-density zones are labeled noise. Powerful for irregularly shaped clusters, but sensitive to eps and min_samples.
MeanShift: Shifts a kernel window towards local density maxima. Discovers K automatically based on data density. The bandwidth parameter controls kernel size.

The following steps on clustering are:

Step 4.1: KMeans
- Elbow method + silhouette analysis to select K
- Fit KMeans with K
- Numeric profile table + detailed per-cluster summaries + heatmap
Step 4.2: DBSCAN
- k-distance plot for eps selection
- Fit DBSCAN
- Numeric profile table + detailed per-cluster summaries + heatmap
Step 4.3: MeanShift
- Bandwidth estimation with estimate_bandwidth
- Fit MeanShift
- Numeric profile table + detailed per-cluster summaries + heatmap
Step 5 - Side-by-side comparison
- PCA projection (1×3 subplots: KMeans | DBSCAN | MeanShift), with KMeans cluster centers marked
- t-SNE projection (same layout)
- Quantitative metrics table (Silhouette, Davies-Bouldin, Inertia)

After fitting all three algorithms, we compare their results using PCA and t-SNE projections in Step 5.

Here, we define some more methods for further cluster analysis:

profile_cols = ['Age', 'BMI', 'Num_Chronic_Conditions', 'Annual_Visits', 'Avg_Billing_Amount', 'Days_Since_Last_Visit']

# Method for printing descriptive statistics for a specified cluster algorithm
def print_cluster_profile(clustering='Cluster_KMeans'):
    cluster_profile = df.groupby(clustering)[profile_cols].agg(['mean', 'std']).round(2)
    display(cluster_profile)

# Method for printing aggregated statistics for a specified cluster algorithm and cluster number
def print_cluster_statistics(clustering='Cluster_KMeans', cluster_nb=0):
    grp = df[df[clustering] == cluster_nb]
    n = len(grp)
    print(f"\n{'='*70}")
    print(f"  CLUSTER {cluster_nb}  (n={n}, {n/len(df)*100:.1f}% of patients)")
    print(f"{'='*70}")
    print(f"  Age:                {grp['Age'].mean():.1f} ± {grp['Age'].std():.1f}")
    print(f"  BMI:                {grp['BMI'].mean():.1f} ± {grp['BMI'].std():.1f}")
    print(f"  Chronic conditions: {grp['Num_Chronic_Conditions'].mean():.2f} (avg)")
    top_cond = grp['Primary_Condition'].value_counts().head(3)
    print(f"  Top conditions:     {', '.join([f'{cd} ({ct})' for cd, ct in top_cond.items()])}")
    print(f"  Annual visits:      {grp['Annual_Visits'].mean():.1f}")
    print(f"  Avg billing:        ${grp['Avg_Billing_Amount'].mean():,.0f}")
    print(f"  Preventive care:    {grp['Preventive_Care_Flag'].mean()*100:.1f}%")
    gender_pct = grp['Gender'].value_counts(normalize=True) * 100
    print(f"  Gender:             {', '.join([f'{g}: {p:.0f}%' for g, p in gender_pct.items()])}")
    ins_pct = grp['Insurance_Type'].value_counts(normalize=True) * 100
    print(f"  Insurance:          {', '.join([f'{t}: {p:.0f}%' for t, p in ins_pct.items()])}")

# Method for visualizing cluster means
def plot_cluster_means_heatmap(clustering='Cluster_KMeans'):
    cluster_means = df.groupby(clustering)[profile_cols].mean()
    norm = (cluster_means - cluster_means.min()) / (cluster_means.max() - cluster_means.min())

    fig, ax = plt.subplots(figsize=(10, 4))
    sns.heatmap(norm.T, annot=cluster_means.T.round(1).values, fmt='', annot_kws={"size": 8},
                cmap='YlOrRd', linewidths=1, ax=ax, cbar_kws={'label': 'Normalized value'})
    ax.set_xlabel(clustering, fontsize=10)
    ax.set_title('Cluster Profiles Comparison\n(numbers = actual means, color = normalized)', fontsize=12,)
    plt.tight_layout()
    plt.show()

Step 4.1: KMeans#

KMeans partitions data into K clusters by minimizing within-cluster variance (inertia). Each point is assigned to its nearest centroid; centroids are updated iteratively until convergence.

Assumptions:

All features are numeric and on a comparable scale (met via StandardScaler)
Clusters are roughly spherical and similarly sized
K must be specified in advance, we determine it with the methods below

Determine the Optimal Number of Clusters#

We use two complementary methods:

Elbow Method - plot inertia (within-cluster sum of squares) vs. K; look for the “elbow” where gains diminish
Silhouette Score - measures how well each point fits its own cluster vs. its nearest neighbor cluster (range −1 to 1, higher is better)

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

K_range = range(2, 15)
inertias = []
silhouette_scores_kmeans = []

for k in K_range:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = km.fit_predict(X_scaled)
    inertias.append(km.inertia_)
    sil = silhouette_score(X_scaled, labels)
    silhouette_scores_kmeans.append(sil)
    print(f"K={k:2d}  |  Inertia: {km.inertia_:10.0f}  |  Silhouette: {sil:.4f}")

K= 2  |  Inertia:      40206  |  Silhouette: 0.1179
K= 3  |  Inertia:      37111  |  Silhouette: 0.1254
K= 4  |  Inertia:      35237  |  Silhouette: 0.1463
K= 5  |  Inertia:      33327  |  Silhouette: 0.1509
K= 6  |  Inertia:      31437  |  Silhouette: 0.2034
K= 7  |  Inertia:      29238  |  Silhouette: 0.2301
K= 8  |  Inertia:      27051  |  Silhouette: 0.2536
K= 9  |  Inertia:      24850  |  Silhouette: 0.2829
K=10  |  Inertia:      22943  |  Silhouette: 0.2929
K=11  |  Inertia:      21646  |  Silhouette: 0.2675
K=12  |  Inertia:      19959  |  Silhouette: 0.2702
K=13  |  Inertia:      19342  |  Silhouette: 0.2678
K=14  |  Inertia:      18996  |  Silhouette: 0.2578

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# Elbow plot
ax1.plot(list(K_range), inertias, 'bs-', linewidth=1, markersize=4)
ax1.set_xlabel('Number of Clusters (K)', fontsize=12)
ax1.set_ylabel('Inertia (Within-Cluster Sum of Squares)', fontsize=12)
ax1.set_title('Elbow Method', fontsize=13, fontweight='bold')
ax1.set_xticks(list(K_range))
ax1.grid(True, alpha=0.3)

# Silhouette plot
ax2.plot(list(K_range), silhouette_scores_kmeans, 'rs-', linewidth=1, markersize=4)
ax2.set_xlabel('Number of Clusters (K)', fontsize=12)
ax2.set_ylabel('Silhouette Score', fontsize=12)
ax2.set_title('Silhouette Analysis', fontsize=13, fontweight='bold')
ax2.set_xticks(list(K_range))
ax2.grid(True, alpha=0.3)

best_k = list(K_range)[np.argmax(silhouette_scores_kmeans)]
ax2.axvline(x=best_k, color='green', linestyle='--', alpha=0.7, label=f'Best K={best_k}')
ax2.legend(fontsize=11)

plt.suptitle('KMeans: Selecting the Number of Clusters', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print(f"\nBest K by Silhouette Score: {best_k} (score: {max(silhouette_scores_kmeans):.4f})")

../_images/207c39e664830289869dc35a98e3d2922849f490c7c635cc3fd1f1cdd08ac43d.png

Best K by Silhouette Score: 10 (score: 0.2929)

Choosing K#

The silhouette scores are relatively low overall. This is common for real-world, high-dimensional, mixed-type data. The elbow plot unfortunatelly shows no noticeable flattening.

In practice, domain knowledge should also inform the final choice of K.

We choose K=6 as a practical balance:

Shows the first big bump in silhouette score
Provides enough granularity to identify distinct patient personas
Keeps segments manageable for clinical interpretation
Consistent with healthcare domain expectations (3–6 personas are typical)

Fit KMeans with K#

K = 6

kmeans_final = KMeans(n_clusters=K, random_state=42, n_init=10)
kmeans_labels = kmeans_final.fit_predict(X_scaled)
df['Cluster_KMeans'] = kmeans_labels

print(f"KMeans (K={K}) – Cluster sizes:")
print(df['Cluster_KMeans'].value_counts().sort_index())
print(f"\nSilhouette Score: {silhouette_score(X_scaled, kmeans_labels):.4f}")

KMeans (K=6) – Cluster sizes:
Cluster_KMeans
0    664
1    158
2    495
3    370
4    150
5    163
Name: count, dtype: int64

Silhouette Score: 0.2034

Profile KMeans Clusters#

We inspect the cluster profiles to understand the characteristics of each patient segment - looking at demographics, health metrics, and healthcare utilization.

# Print cluster profile
print_cluster_profile('Cluster_KMeans')

	Age		BMI		Num_Chronic_Conditions		Annual_Visits		Avg_Billing_Amount		Days_Since_Last_Visit
	mean	std	mean	std	mean	std	mean	std	mean	std	mean	std
Cluster_KMeans
0	57.64	11.18	30.97	8.97	1.45	0.75	6.57	3.40	4420.68	2466.95	177.27	106.92
1	56.37	10.79	31.85	7.86	1.43	0.73	6.58	3.49	4679.84	2665.75	198.32	102.21
2	30.40	6.39	29.89	8.83	0.00	0.00	2.51	1.09	2598.31	1393.22	173.49	103.66
3	57.32	10.97	31.47	9.08	1.41	0.72	6.55	3.47	4526.89	2618.55	181.07	104.42
4	58.35	10.77	30.14	8.47	1.47	0.77	5.89	3.65	4428.27	2720.12	177.96	99.78
5	56.45	11.74	30.21	8.85	1.42	0.73	6.00	3.38	4297.17	2581.05	193.61	104.36

# Print statistics for cluster n
print_cluster_statistics('Cluster_KMeans', 0)

======================================================================
  CLUSTER 0  (n=664, 33.2% of patients)
======================================================================
  Age:                57.6 ± 11.2
  BMI:                31.0 ± 9.0
  Chronic conditions: 1.45 (avg)
  Top conditions:     Obesity (183), Anxiety (175), Asthma (160)
  Annual visits:      6.6
  Avg billing:        $4,421
  Preventive care:    51.8%
  Gender:             Female: 51%, Male: 49%
  Insurance:          Medicare: 56%, Private: 25%, Medicaid: 18%

plot_cluster_means_heatmap('Cluster_KMeans')

Step 4.2: DBSCAN#

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) defines clusters as dense regions separated by sparser areas.

Key properties:

No need to specify K, the number of clusters is discovered automatically
Points in low-density regions are labeled as noise (label = −1) and excluded from clusters
Can find clusters of arbitrary shape
Sensitive to two hyperparameters:
- eps: neighborhood radius - how far apart two points can be to be considered neighbors
- min_samples: minimum number of neighbors for a point to be a core point

Select `eps` with the k-Distance Plot#

For each point, we compute the distance to its k-th nearest neighbor (where k = min_samples).
Sorting and plotting these distances reveals an “elbow”, the distance at the elbow is a good candidate for eps.

from sklearn.neighbors import NearestNeighbors

min_samples_dbscan = 10

nbrs = NearestNeighbors(n_neighbors=min_samples_dbscan).fit(X_scaled)
distances, _ = nbrs.kneighbors(X_scaled)
k_distances = np.sort(distances[:, -1])

plt.figure(figsize=(8, 4))
sns.lineplot(x=np.arange(len(k_distances)), y=k_distances)
plt.xlabel('Points sorted by distance', fontsize=10)
plt.ylabel(f'{min_samples_dbscan}-NN distance', fontsize=10)
plt.title('k-Distance Plot for DBSCAN eps Selection', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

print(f"min_samples = {min_samples_dbscan}")
print(f"Suggested eps range (80th–90th percentile of k-distances): "
      f"{np.percentile(k_distances, 80):.2f} – {np.percentile(k_distances, 90):.2f}")

../_images/0e7aa0fef1ebb7b7ccefa9a5e55c353c73f0e2252a4d010ce560a7c5a90bd59d.png

min_samples = 10
Suggested eps range (80th–90th percentile of k-distances): 3.14 – 3.37

Fit DBSCAN#

We set eps based on the elbow in the k-distance plot above. A value near the 80th–90th percentile of k-distances is a common starting point. Adjust eps up to merge clusters or down to split them.

from sklearn.cluster import DBSCAN

eps_dbscan = 3.5

dbscan = DBSCAN(eps=eps_dbscan, min_samples=min_samples_dbscan)
dbscan_labels = dbscan.fit_predict(X_scaled)
df['Cluster_DBSCAN'] = dbscan_labels

n_clusters_dbscan = len(set(dbscan_labels)) - (1 if -1 in dbscan_labels else 0)
n_noise = int((dbscan_labels == -1).sum())

print(f"DBSCAN (eps={eps_dbscan}, min_samples={min_samples_dbscan})")
print(f"Clusters found: {n_clusters_dbscan}")
print(f"Noise points:   {n_noise} ({n_noise / len(dbscan_labels) * 100:.1f}%)")
print(f"\nCluster sizes:")
for label in sorted(set(dbscan_labels)):
    count = (dbscan_labels == label).sum()
    name = f"Cluster {label}" if label != -1 else "Noise (-1)"
    print(f"  {name}: {count} ({count / len(dbscan_labels) * 100:.1f}%)")

mask_non_noise = dbscan_labels != -1
if n_clusters_dbscan > 1 and mask_non_noise.sum() > 1:
    sil_dbscan = silhouette_score(X_scaled[mask_non_noise], dbscan_labels[mask_non_noise])
    print(f"\nSilhouette Score (non-noise points): {sil_dbscan:.4f}")

DBSCAN (eps=3.5, min_samples=10)
Clusters found: 11
Noise points:   4 (0.2%)

Cluster sizes:
  Noise (-1): 4 (0.2%)
  Cluster 0: 158 (7.9%)
  Cluster 1: 160 (8.0%)
  Cluster 2: 210 (10.5%)
  Cluster 3: 174 (8.7%)
  Cluster 4: 429 (21.4%)
  Cluster 5: 163 (8.2%)
  Cluster 6: 150 (7.5%)
  Cluster 7: 183 (9.2%)
  Cluster 8: 145 (7.2%)
  Cluster 9: 66 (3.3%)
  Cluster 10: 158 (7.9%)

Silhouette Score (non-noise points): 0.3233

Profile DBSCAN Clusters#

We inspect the cluster profiles to understand the characteristics of each patient segment - looking at demographics, health metrics, and healthcare utilization.

# Print cluster profile
print_cluster_profile('Cluster_DBSCAN')

	Age		BMI		Num_Chronic_Conditions		Annual_Visits		Avg_Billing_Amount		Days_Since_Last_Visit
	mean	std	mean	std	mean	std	mean	std	mean	std	mean	std
Cluster_DBSCAN
-1	63.25	0.96	22.65	7.31	3.00	0.00	4.25	5.19	10985.00	852.37	171.75	150.51
0	57.54	10.63	31.72	8.77	1.39	0.67	6.77	3.29	4219.30	2559.14	177.58	102.48
1	57.24	11.19	32.00	9.55	1.41	0.72	6.91	3.34	4341.62	2399.63	183.83	103.24
2	57.09	11.28	31.35	9.31	1.41	0.74	6.43	3.59	4689.83	2567.73	182.60	105.89
3	58.78	10.95	31.70	8.79	1.47	0.77	6.41	3.46	4537.60	2637.18	176.84	112.69
4	31.54	5.92	29.82	8.81	0.00	0.00	2.46	1.10	2578.62	1381.46	174.38	103.87
5	56.45	11.74	30.21	8.85	1.42	0.73	6.00	3.38	4297.17	2581.05	193.61	104.36
6	58.35	10.77	30.14	8.47	1.47	0.77	5.89	3.65	4428.27	2720.12	177.96	99.78
7	57.62	11.53	29.81	8.48	1.48	0.76	6.65	3.37	4193.49	2385.80	175.57	103.49
8	56.66	11.07	30.58	8.98	1.40	0.73	6.30	3.38	4573.91	2351.09	174.48	108.41
9	23.03	4.06	30.33	9.02	0.00	0.00	2.85	0.98	2726.32	1471.93	167.76	102.86
10	56.37	10.79	31.85	7.86	1.43	0.73	6.58	3.49	4679.84	2665.75	198.32	102.21

# Print statistics for cluster n
print_cluster_statistics('Cluster_DBSCAN', 0)

======================================================================
  CLUSTER 0  (n=158, 7.9% of patients)
======================================================================
  Age:                57.5 ± 10.6
  BMI:                31.7 ± 8.8
  Chronic conditions: 1.39 (avg)
  Top conditions:     Depression (158)
  Annual visits:      6.8
  Avg billing:        $4,219
  Preventive care:    48.1%
  Gender:             Female: 54%, Male: 46%
  Insurance:          Medicare: 46%, Medicaid: 27%, Private: 27%

plot_cluster_means_heatmap('Cluster_DBSCAN')

DBSCAN Results#

Why DBSCAN struggles here: High-dimensional patient data (23 features after encoding) tends to have roughly uniform density across the feature space. There are no sharp density drop-offs between patient groups. DBSCAN excels when clusters visibly differ in local density (e.g., spatial or sensor data). Adjusting eps changes the result, but no clear multi-cluster structure emerges.

Key observations:

DBSCAN finds very few meaningful clusters, decreasing eps tends to create lots of small clusters
The noise percentage is sensitive to eps: increasing eps reduces noise but merges clusters
This behavior is typical for dense, uniformly distributed tabular health data

Step 4.3: MeanShift#

MeanShift iteratively shifts each data point towards the local density maximum (mode) by averaging over all points within a kernel window. Points that converge to the same mode form a cluster.

Key properties:

No need to specify K, the number of clusters is determined automatically by data density
All points are assigned to a cluster (no explicit noise concept)
The bandwidth parameter controls the kernel size:
- Larger bandwidth: fewer, coarser clusters
- Smaller bandwidth: more, finer clusters

Estimate Bandwidth#

The estimate_bandwidth function uses a nearest-neighbor heuristic to suggest a reasonable bandwidth. The quantile parameter (0–1) controls the fraction of pairwise distances used: lower values yield smaller bandwidth and more clusters.

from sklearn.cluster import MeanShift, estimate_bandwidth

# Estimate bandwidth on a random subset for speed
bandwidth = estimate_bandwidth(X_scaled, quantile=0.1, n_samples=1000, random_state=42)
print(f"Estimated bandwidth: {bandwidth:.4f}")
print(f"\nAdjust quantile to control cluster granularity:")
print(f"- Lower quantile -> smaller bandwidth -> more clusters")
print(f"- Higher quantile -> larger bandwidth -> fewer clusters")

Estimated bandwidth: 5.3331

Adjust quantile to control cluster granularity:
- Lower quantile -> smaller bandwidth -> more clusters
- Higher quantile -> larger bandwidth -> fewer clusters

Fit MeanShift#

ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms_labels = ms.fit_predict(X_scaled)
df['Cluster_MeanShift'] = ms_labels

n_clusters_ms = len(set(ms_labels))
print(f"MeanShift (bandwidth={bandwidth:.4f})")
print(f"Clusters found: {n_clusters_ms}")
print(f"\nCluster sizes:")
for label in sorted(set(ms_labels)):
    count = (ms_labels == label).sum()
    print(f"  Cluster {label}: {count} ({count / len(ms_labels) * 100:.1f}%)")

if n_clusters_ms > 1:
    sil_ms = silhouette_score(X_scaled, ms_labels)
    print(f"\nSilhouette Score: {sil_ms:.4f}")

MeanShift (bandwidth=5.3331)
Clusters found: 2

Cluster sizes:
  Cluster 0: 1934 (96.7%)
  Cluster 1: 66 (3.3%)

Silhouette Score: 0.2212

Profile MeanShift Clusters#

We inspect the cluster profiles to understand the characteristics of each patient segment - looking at demographics, health metrics, and healthcare utilization.

# Print cluster profile
print_cluster_profile('Cluster_MeanShift')

	Age		BMI		Num_Chronic_Conditions		Annual_Visits		Avg_Billing_Amount		Days_Since_Last_Visit
	mean	std	mean	std	mean	std	mean	std	mean	std	mean	std
Cluster_MeanShift
0	51.64	14.80	30.75	8.84	1.12	0.88	5.56	3.51	4043.75	2478.82	180.51	104.75
1	23.03	4.06	30.33	9.02	0.00	0.00	2.85	0.98	2726.32	1471.93	167.76	102.86

# Print statistics for cluster n
print_cluster_statistics('Cluster_MeanShift', 0)

======================================================================
  CLUSTER 0  (n=1934, 96.7% of patients)
======================================================================
  Age:                51.6 ± 14.8
  BMI:                30.8 ± 8.8
  Chronic conditions: 1.12 (avg)
  Top conditions:     None (429), Hypertension (210), Obesity (183)
  Annual visits:      5.6
  Avg billing:        $4,044
  Preventive care:    46.8%
  Gender:             Male: 50%, Female: 50%
  Insurance:          Medicare: 47%, Private: 28%, Medicaid: 25%

plot_cluster_means_heatmap('Cluster_MeanShift')

MeanShift Results#

MeanShift often yields a small number of clusters on this dataset because the patient data occupies a relatively compact, uniform region in high-dimensional space.
The clusters found correspond to the dominant density modes.

Observations:

The number of clusters is determined automatically based on data density
Unlike DBSCAN, all patients are assigned to a cluster (no noise)
The bandwidth parameter is the main tuning lever; estimate_bandwidth provides a data-driven starting point

Step 5: Comparison of Clustering Algorithms#

We now compare all three algorithms side-by-side using two dimensionality reduction techniques:

Step 5.1: PCA - linear projection, preserves global variance structure, fast
Step 5.2: t-SNE - non-linear projection, preserves local neighborhood structure, slower

Both projections are computed on the same scaled data X_scaled and colored by each algorithm’s cluster labels.

PCA Projection#

from sklearn.decomposition import PCA

# Compute 2D PCA projection
pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X_scaled)
print(f"PCA explained variance: {pca.explained_variance_ratio_.sum()*100:.1f}% total  "
      f"(PC1: {pca.explained_variance_ratio_[0]*100:.1f}%,  "
      f"PC2: {pca.explained_variance_ratio_[1]*100:.1f}%)")

PCA explained variance: 25.0% total  (PC1: 16.1%,  PC2: 9.0%)

# Helper: scatter plot of cluster labels on 2D coordinates
def plot_clusters_2d(ax, X_2d, labels, title, xlabel='PC_1', ylabel='PC_2'):
    palette = plt.cm.tab10.colors
    for i, label in enumerate(sorted(set(labels))):
        mask = np.array(labels) == label
        color = 'lightgrey' if label == -1 else palette[i % len(palette)]
        lname = 'Noise' if label == -1 else f'Cluster {label}'
        mk = 'x' if label == -1 else 'o'
        ax.scatter(X_2d[mask, 0], X_2d[mask, 1],
                   c=[color], label=lname, alpha=0.5, s=20,
                   marker=mk, linewidth=0.2)
    ax.set_xlabel(xlabel, fontsize=10)
    ax.set_ylabel(ylabel, fontsize=10)
    ax.set_title(title, fontsize=12, fontweight='bold')
    ax.legend(fontsize=8, loc='best', markerscale=1.5)

fig, axes = plt.subplots(1, 3, figsize=(20, 6))

# KMeans
plot_clusters_2d(ax=axes[0], X_2d=X_pca, labels=kmeans_labels,
                 title=f'KMeans  (K={K})')

# DBSCAN
n_db = len(set(dbscan_labels)) - (1 if -1 in dbscan_labels else 0)
plot_clusters_2d(ax=axes[1], X_2d=X_pca, labels=dbscan_labels,
                 title=f'DBSCAN  (eps={eps_dbscan}, min_s={min_samples_dbscan})\n'
                 f'{n_db} cluster(s), {n_noise} noise points')

# MeanShift
n_ms = len(set(ms_labels))
plot_clusters_2d(ax=axes[2], X_2d=X_pca, labels=ms_labels,
                 title=f'MeanShift  (bw={bandwidth:.2f})\n{n_ms} cluster(s)')

plt.suptitle('PCA Projection', fontsize=15, fontweight='bold')
plt.tight_layout()
plt.show()

../_images/975a067004959668c353c739542e0c29e0add13975996c792289730937be66a9.png

t-SNE Projection#

t-SNE (t-distributed Stochastic Neighbor Embedding) is a non-linear technique that preserves local neighborhood structure, revealing fine-grained cluster topology that PCA may miss.

Key properties:

Non-linear: better at revealing local cluster boundaries
Stochastic: results depend on the random seed and perplexity (~5–50)
Not suitable for projecting new points — purely a visualization tool
Interpretation: nearby points in t-SNE space are locally similar, but distances between clusters are not directly meaningful

from sklearn.manifold import TSNE

print("Computing t-SNE (this may take a moment)...")
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
X_tsne = tsne.fit_transform(X_scaled)
print("Done.")

Computing t-SNE (this may take a moment)...
Done.

fig, axes = plt.subplots(1, 3, figsize=(20, 6))

n_db = len(set(dbscan_labels)) - (1 if -1 in dbscan_labels else 0)
n_ms = len(set(ms_labels))

# KMeans
plot_clusters_2d(ax=axes[0], X_2d=X_tsne, labels=kmeans_labels,
                 title=f'KMeans  (K={K})', xlabel='t-SNE 1', ylabel='t-SNE 2')

# DBSCAN
plot_clusters_2d(ax=axes[1], X_2d=X_tsne, labels=dbscan_labels,
                 title=f'DBSCAN  (eps={eps_dbscan}, min_s={min_samples_dbscan})\n'
                 f'{n_db} cluster(s), {n_noise} noise points',
                 xlabel='t-SNE 1', ylabel='t-SNE 2')


plot_clusters_2d(ax=axes[2], X_2d=X_tsne, labels=ms_labels,
                 title=f'MeanShift  (bw={bandwidth:.2f})\n{n_ms} cluster(s)',
                 xlabel='t-SNE 1', ylabel='t-SNE 2')

plt.suptitle('t-SNE Projection', fontsize=15, fontweight='bold')
plt.tight_layout()
plt.show()

../_images/3afc63e786f99864b61b4f9bd0c57da7fe42cd002acc3d9f292deeb0374b88d7.png

from sklearn.metrics import davies_bouldin_score

results = {}

# KMeans
results['KMeans'] = {
    'n_clusters':      K,
    'n_noise':         0,
    'silhouette':      round(silhouette_score(X_scaled, kmeans_labels), 4),
    'davies_bouldin':  round(davies_bouldin_score(X_scaled, kmeans_labels), 4),
    'inertia':         round(kmeans_final.inertia_, 1),
}

# DBSCAN
mask_nn = dbscan_labels != -1
if n_clusters_dbscan > 1 and mask_nn.sum() > 1:
    sil_db = round(silhouette_score(X_scaled[mask_nn], dbscan_labels[mask_nn]), 4)
    dbi_db = round(davies_bouldin_score(X_scaled[mask_nn], dbscan_labels[mask_nn]), 4)
else:
    sil_db = float('nan')
    dbi_db = float('nan')

results['DBSCAN'] = {
    'n_clusters':     n_clusters_dbscan,
    'n_noise':        n_noise,
    'silhouette':     sil_db,
    'davies_bouldin': dbi_db,
    'inertia':        float('nan'),
}

# MeanShift
results['MeanShift'] = {
    'n_clusters':     n_clusters_ms,
    'n_noise':        0,
    'silhouette':     round(silhouette_score(X_scaled, ms_labels), 4) if n_clusters_ms > 1 else float('nan'),
    'davies_bouldin': round(davies_bouldin_score(X_scaled, ms_labels), 4) if n_clusters_ms > 1 else float('nan'),
    'inertia':        float('nan'),
}

df_metrics = pd.DataFrame(results).T
df_metrics.index.name = 'Algorithm'
print("Quantitative Metric Comparison\n")
print(df_metrics.to_string())
print("\nMetric interpretation:")
print("  Silhouette:     higher is better  (range: −1 to 1)")
print("  Davies-Bouldin: lower is better   (range: 0 to ∞)")
print("  Inertia:        lower is better   (KMeans only)")

Quantitative Metric Comparison

           n_clusters  n_noise  silhouette  davies_bouldin  inertia
Algorithm                                                          
KMeans            6.0      0.0      0.2034          2.2101  31437.0
DBSCAN           11.0      4.0      0.3233          1.3009      NaN
MeanShift         2.0      0.0      0.2212          1.0641      NaN

Metric interpretation:
  Silhouette:     higher is better  (range: −1 to 1)
  Davies-Bouldin: lower is better   (range: 0 to ∞)
  Inertia:        lower is better   (KMeans only)

Step 6: Wrap-Up and Summary#

Algorithm Comparison Summary#

	KMeans	DBSCAN	MeanShift
Clusters found	6 (specified)	Medium (auto)	Very Few (auto)
Noise handling	None, all points assigned	Labels outliers −1	None, all points assigned
Shape flexibility	Spherical	Arbitrary	Arbitrary
Main tuning lever	K	`eps`, `min_samples`	`bandwidth`
Best suited when	K is known, compact clusters	Data has clear density contrasts	Data has clear density modes

Why KMeans Works Best Here#

Patient data in high-dimensional, mixed-feature space tends to be uniformly dense.
There are no sharp density drop-offs between patient groups. DBSCAN and MeanShift struggle in this setting because:

DBSCAN requires density contrasts that simply don’t exist in this tabular data
MeanShift’s automatic bandwidth estimation yields very few, coarse clusters

KMeans provides stable, interpretable segments despite its spherical-cluster assumption.

Limitations#

KMeans on one-hot data: Euclidean distance on binary features is pragmatic but imperfect; K-Prototypes handles mixed types more naturally
K selection is ambiguous: different K values yield different but potentially valid segmentations — always validate with domain experts
Geographic features excluded: including location could reveal regional patterns
Synthetic data: patterns may be clearer than in real clinical data

Next Steps#

Validate clusters with clinical domain experts
Try K-Prototypes (native mixed-type support) or hierarchical clustering with Gower distance
Apply UMAP for non-linear dimensionality reduction before clustering
Use cluster labels as features for downstream prediction tasks

Comprehensive Exercise on Segmentation Analysis - Solution

Contents

Comprehensive Exercise on Segmentation Analysis - Solution#

Analysis Plan#

Step 1: Load and Inspect the Data#

Observations from Data Inspection#

Step 2: Exploratory Data Analysis (EDA)#

Key EDA Insights#

Step 3: Feature Selection and Preprocessing#

Feature decisions#

Why Scaling?#

Step 4: Algorithm Selection#

Overview of Clustering Algorithms#

Step 4.1: KMeans#

Determine the Optimal Number of Clusters#

Choosing K#

Fit KMeans with K#

Profile KMeans Clusters#

Step 4.2: DBSCAN#

Select eps with the k-Distance Plot#

Fit DBSCAN#

Profile DBSCAN Clusters#

DBSCAN Results#

Step 4.3: MeanShift#

Estimate Bandwidth#

Fit MeanShift#

Profile MeanShift Clusters#

MeanShift Results#

Step 5: Comparison of Clustering Algorithms#

PCA Projection#

t-SNE Projection#

Step 6: Wrap-Up and Summary#

Algorithm Comparison Summary#

Why KMeans Works Best Here#

Limitations#

Next Steps#

Select `eps` with the k-Distance Plot#