Data Transformation

Data Transformation#

This notebook focuses on transforming cleaned data into analysis-ready features.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler, StandardScaler

patients_cleaned = pd.read_csv("data/patients_cleaned.csv")

patients_cleaned.head()

	patient_id	age	admission_date	discharge_date	gender	weight	height	blood_pressure	admission_unit	albumin_g_dl
0	1	39.0	2023-01-05	2023-01-10	f	78.0	184.0	130.0	Surgery	1.00
1	2	60.0	2023-02-11	2023-02-20	m	55.0	180.0	90.0	Cardiol.	4.26
2	4	50.0	2023-03-20	2023-03-25	f	70.0	179.0	110.0	Intensive Care U.	3.10
3	10	45.0	2023-07-05	2023-07-15	m	90.0	165.0	80.0	Neurology	3.10
4	12	90.0	2023-08-10	2023-08-20	female	45.0	170.0	60.0	Intensive Care U.	4.26

#patients_cleaned.info()

1 Checking categorical variables#

Categorical variables often contain multiple spellings or encodings for the same concept. We need to standardize these.
Here we focus on the gender column:
- First, we inspect unique values and their counts.
- Then we replace inconsistent entries with a standard value.
Casting to category dtype can reduce memory and make intent explicit.

# Show number of unique categories and their counts
patients_cleaned['gender'].value_counts(dropna=False)

gender
m         9
male      9
f         6
female    2
Name: count, dtype: int64

# Correct inconsistent entries - f -> female, m -> male
patients_cleaned['gender'] = patients_cleaned['gender'].replace('f', 'female')
patients_cleaned['gender'] = patients_cleaned['gender'].replace('m', 'male')

patients_cleaned['gender'].value_counts(dropna=False)

gender
male      18
female     8
Name: count, dtype: int64

# visualize gender distribution

Check admission_unit for unique values#

# Check unique values in admission_unit
patients_cleaned['admission_unit'].value_counts(dropna=False)

admission_unit
Intensive Care U.    5
Psychiatry           5
Cardiol.             3
Surgery              2
Neurology            2
Orthopedics          2
Pediatrics           2
Emergency Room       2
General Medicine     2
Oncology             1
Name: count, dtype: int64

Rename inconsistent entries#

# Standardize admission_unit entries
patients_cleaned['admission_unit'] = patients_cleaned['admission_unit'].replace({'Emergency Room': 'ER', 'Intensive Care U.': 'ICU', 'Cardiol.': 'Cardiology'})

# visualize admission_unit distribution

plt.figure(figsize=(8,4))
sns.countplot(data=patients_cleaned, x='admission_unit', order=patients_cleaned['admission_unit'].value_counts().index)
plt.title('Admission Unit Distribution')
plt.ylabel('Number of Patients')
plt.xlabel('Admission Unit')
plt.xticks(rotation=45)
plt.show()

../_images/a7ca87d7815772b2b27f4b4f8d5110f1e5ffce609e77555e4d043dceca2f76d1.png

2 Encoding categorical variables for modeling#

Many models require numeric inputs. pd.get_dummies() creates one-hot encoded columns for categorical variables.
For high-cardinality categorical features you may want alternative encoding strategies (target encoding, embedding, hashing).

One-hot encoding for ‘gender’#

# One-hot encode gender
patients_cleaned = pd.get_dummies(patients_cleaned, columns=['gender'], drop_first=True)

Label Encoding for ‘admission_unit’#

For admission_unit with multiple categories, we use label encoding to convert categories to integer codes.
This is simple but imposes an ordinal relationship. For non-ordinal categories, one-hot encoding is often preferred.
Here we demonstrate label encoding for variety.
First convert to ‘category’ dtype, then use .cat.codes to get integer codes.
Note: In practice, use sklearn’s LabelEncoder or OrdinalEncoder for more control.

# Convert admission_unit to 'category' dtype
patients_cleaned['admission_unit_encoded'] = patients_cleaned['admission_unit'].astype('category')

# Label encode admission_unit
patients_cleaned['admission_unit_encoded'] = patients_cleaned['admission_unit_encoded'].cat.codes

patients_cleaned.head()

	patient_id	age	admission_date	discharge_date	weight	height	blood_pressure	admission_unit	albumin_g_dl	gender_male	admission_unit_encoded
0	1	39.0	2023-01-05	2023-01-10	78.0	184.0	130.0	Surgery	1.00	False	9
1	2	60.0	2023-02-11	2023-02-20	55.0	180.0	90.0	Cardiology	4.26	True	0
2	4	50.0	2023-03-20	2023-03-25	70.0	179.0	110.0	ICU	3.10	False	3
3	10	45.0	2023-07-05	2023-07-15	90.0	165.0	80.0	Neurology	3.10	True	4
4	12	90.0	2023-08-10	2023-08-20	45.0	170.0	60.0	ICU	4.26	False	3

3 Feature engineering - creating new features from existing ones#

BMI (Body Mass Index) is a common clinical feature derived from weight and height.
BMI = weight (kg) / (height (m))2
In our dataset height is in cm, so it needs to be converted before calculation.

# Compute BMI (height in cm -> convert to meters)
patients_cleaned['BMI'] = patients_cleaned['weight'] / (patients_cleaned['height']/100)**2

# Inspect BMI distribution and missingness
patients_cleaned['BMI'].describe()

count    26.000000
mean     25.087716
std       6.945768
min      14.527376
25%      21.232993
50%      22.505044
75%      29.602852
max      40.562466
Name: BMI, dtype: float64

Save transformed data#

# Save the transformed dataset
patients_cleaned.to_csv("data/patients_transformed.csv", index=False)

4 Scaling and standardization#

Many ML algorithms assume features are on similar scales
MinMaxScaler rescales to [0,1]
StandardScaler centers to mean=0 and std=1

We demonstrate both for different use cases.

Method 1: Min-Max Scaling#

# Demonstrate MinMax scaling for 'age' and Standard scaling for blood pressure
minmax = MinMaxScaler()
patients_cleaned['age_minmax'] = minmax.fit_transform(patients_cleaned[['age']])

Method 2: Standardization#

std = StandardScaler()
patients_cleaned['bp_std'] = std.fit_transform(patients_cleaned[['blood_pressure']])

patients_cleaned.head()

	patient_id	age	admission_date	discharge_date	weight	height	blood_pressure	admission_unit	albumin_g_dl	gender_male	admission_unit_encoded	BMI	age_minmax	bp_std
0	1	39.0	2023-01-05	2023-01-10	78.0	184.0	130.0	Surgery	1.00	False	9	23.038752	0.105263	1.274066
1	2	60.0	2023-02-11	2023-02-20	55.0	180.0	90.0	Cardiology	4.26	True	0	16.975309	0.473684	-0.436090
2	4	50.0	2023-03-20	2023-03-25	70.0	179.0	110.0	ICU	3.10	False	3	21.847009	0.298246	0.418988
3	10	45.0	2023-07-05	2023-07-15	90.0	165.0	80.0	Neurology	3.10	True	4	33.057851	0.210526	-0.863629
4	12	90.0	2023-08-10	2023-08-20	45.0	170.0	60.0	ICU	4.26	False	3	15.570934	1.000000	-1.718707

Further preprocessing for modeling#

scaling all numeric features
dropping any remaining irrelevant columns
handling any remaining missing values
drop identifier patient_id

Exercise — Data Transformation#

Standardize the blood_pressure column (use StandardScaler - is already imported), store as bp_standardized.
Create a new feature length_of_stay as the difference in days between discharge_date and admission_date. (Hint: columns need to be datetime dtype)
Plot a histogram of the computed BMI.

# 1. Standardize blood_pressure

# 2. Create length_of_stay feature

# 3. Plot histogram of BMI

Data Transformation

Contents

Data Transformation#

1 Checking categorical variables#

Check admission_unit for unique values#

Rename inconsistent entries#

2 Encoding categorical variables for modeling#

One-hot encoding for ‘gender’#

Label Encoding for ‘admission_unit’#

3 Feature engineering - creating new features from existing ones#

Save transformed data#

4 Scaling and standardization#

Method 1: Min-Max Scaling#

Method 2: Standardization#

Further preprocessing for modeling#

Exercise — Data Transformation#