Data Handling and Preprocessing#

This session provides an introduction to data preprocessing in Python using the pandas library, tailored to medical research and clinical datasets. Participants will work with synthetic patient data that includes demographics, admission and discharge information, vital signs, and lab values.

Learning Objectives#

  • Understand the role of data preprocessing in clinical research and machine learning.

  • Get familiar with pandas as the core library for handling tabular data in Python.

  • Practice working with structured medical datasets containing typical data quality issues.

Topics Covered#

Introduction to pandas#

  • DataFrames and Series

  • Basic exploration and manipulation of tabular data

Data Cleaning#

  • Merging patient and lab datasets

  • Correcting data types

  • Detecting and handling implausible values

  • Handling missing values

  • Removing duplicates

Data Transformation#

  • Encoding categorical variables

  • Feature Engineering

  • Normalization and Scaling

Slides used for this session can be downloaded as PDF [here](Data Handling and Preprocessing.pdf)