🐼 Introduction to pandas

🐼 Introduction to pandas#

pandas is an open-source Python library for working with and analyzing tabular data (similar to Excel spreadsheets) and time series. It is extremely useful in medical research because it makes it easy to handle patient data, lab results, ICU data, and many other clinical datasets.

pandas is built around two main data structures:

Series → A one-dimensional array with labels (like a single column in a table).
DataFrame → A two-dimensional tabular structure with labeled rows and columns (like a full table).

Why pandas?

Efficient handling of messy real-world data (missing values, duplicates, different formats).
High compatibility with other Python libraries (e.g., matplotlib, seaborn, scikit-learn).
Widely considered the de facto standard for tabular data in Python.

👉 A great starting point is the official 10-Minute Tutorial.

Alternatives to pandas

While pandas is the most common tool, there are alternatives:

Dask – for parallel processing of very large datasets (bigger than your computer’s memory). Often used in big data and machine learning pipelines.
Polars – a modern, very fast data analysis library with increasing compatibility to pandas-like operations.

# Import pandas
import pandas as pd

Creating a DataFrame with a dictionary#

We usually work with DataFrames in pandas. A DataFrame can be created from many sources:

Python dictionaries
Excel or CSV files
Databases
JSON or APIs

Here we create a simple patient dataset as a dictionary and load it into a DataFrame.

# Example: Patient data
data_dict = {
    "patient_id": [1, 2, 3],
    "name": ["Anna", "Ben", "Clara"],
    "age": [65, 42, 58],
    "gender": ["F", "M", "F"],
    "admission_date": ["2024-12-01", "2024-12-15", "2024-12-31"],
    "blood_pressure": [120, 145, 130]
}

patients = pd.DataFrame(data=data_dict)

patients

	patient_id	name	age	gender	admission_date	blood_pressure
0	1	Anna	65	F	2024-12-01	120
1	2	Ben	42	M	2024-12-15	145
2	3	Clara	58	F	2024-12-31	130

Exploring a DataFrame#

pandas offers many methods to explore and understand your dataset.

.info() → metadata: number of rows, columns, data types, missing values
.head(n) → shows the first n rows (default = 5)
.tail(n) → shows the last n rows
.describe() → summary statistics (mean, min, max, standard deviation) for numeric columns

# Show metadata
patients.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   patient_id      3 non-null      int64 
 1   name            3 non-null      object
 2   age             3 non-null      int64 
 3   gender          3 non-null      object
 4   admission_date  3 non-null      object
 5   blood_pressure  3 non-null      int64 
dtypes: int64(3), object(3)
memory usage: 276.0+ bytes

# Show first 2 rows
patients.head(2)

	patient_id	name	age	gender	admission_date	blood_pressure
0	1	Anna	65	F	2024-12-01	120
1	2	Ben	42	M	2024-12-15	145

# Show last 2 rows
patients.tail(2)

	patient_id	name	age	gender	admission_date	blood_pressure
1	2	Ben	42	M	2024-12-15	145
2	3	Clara	58	F	2024-12-31	130

# Summary statistics
patients.describe().transpose()

	count	mean	std	min	25%	50%	75%	max
patient_id	3.0	2.000000	1.000000	1.0	1.5	2.0	2.5	3.0
age	3.0	55.000000	11.789826	42.0	50.0	58.0	61.5	65.0
blood_pressure	3.0	131.666667	12.583057	120.0	125.0	130.0	137.5	145.0

Selecting Columns and Rows#

You can select columns and rows in different ways.

### Column names
patients.columns

Index(['patient_id', 'name', 'age', 'gender', 'admission_date',
       'blood_pressure'],
      dtype='object')

### Row index
patients.index

RangeIndex(start=0, stop=3, step=1)

By default, pandas assigns a numeric index (0, 1, 2…). We can also use an existing column (like patient_id) as the index.

patients_new = patients.set_index("patient_id")
patients_new.head()

	name	age	gender	admission_date	blood_pressure
patient_id
1	Anna	65	F	2024-12-01	120
2	Ben	42	M	2024-12-15	145
3	Clara	58	F	2024-12-31	130

Selecting Columns#

We can select:

A single column → returns a Series
Multiple columns → returns a DataFrame

# One column
patients_new["age"]

patient_id
1    65
2    42
3    58
Name: age, dtype: int64

# Multiple columns
patients_new[["age", "blood_pressure"]]

	age	blood_pressure
patient_id
1	65	120
2	42	145
3	58	130

Row Selection (Label-based with loc)#

With .loc[], we can select rows by their labels (patient_id in this case).

# Select patient with ID 1
patients_new.loc[1]

name                    Anna
age                       65
gender                     F
admission_date    2024-12-01
blood_pressure           120
Name: 1, dtype: object

# Select multiple patients
patients_new.loc[[1, 3]]

	name	age	gender	admission_date	blood_pressure
patient_id
1	Anna	65	F	2024-12-01	120
3	Clara	58	F	2024-12-31	130

# Select rows + specific columns
patients_new.loc[[1,2], ["age", "blood_pressure"]]

	age	blood_pressure
patient_id
1	65	120
2	42	145

Row Selection (Integer-based with iloc)#

With .iloc[], we select rows and columns by their integer positions.

# First row (position 0)
patients_new.iloc[0]

name                    Anna
age                       65
gender                     F
admission_date    2024-12-01
blood_pressure           120
Name: 1, dtype: object

# First and third row, first two columns
patients_new.iloc[[0, 2], :2]

	name	age
patient_id
1	Anna	65
3	Clara	58

Boolean Indexing (Filtering Rows)#

Boolean indexing allows us to filter rows based on conditions.

# All female patients
patients_new[patients_new["gender"] == "F"]

	name	age	gender	admission_date	blood_pressure
patient_id
1	Anna	65	F	2024-12-01	120
3	Clara	58	F	2024-12-31	130

# Patients older than 50
patients_new[patients_new["age"] > 50]

	name	age	gender	admission_date	blood_pressure
patient_id
1	Anna	65	F	2024-12-01	120
3	Clara	58	F	2024-12-31	130

# Combine conditions: female patients older than 50
patients_new[(patients_new["gender"] == "F") & (patients_new["age"] > 50)]

	name	age	gender	admission_date	blood_pressure
patient_id
1	Anna	65	F	2024-12-01	120
3	Clara	58	F	2024-12-31	130

Exercise#

Select all patients older than 50.
Show only the name and blood_pressure columns for patients with blood pressure > 130.
Use .iloc to return the first two rows and the last two columns.