🐼 Introduction to pandas#
pandas is an open-source Python library for working with and analyzing tabular data (similar to Excel spreadsheets) and time series. It is extremely useful in medical research because it makes it easy to handle patient data, lab results, ICU data, and many other clinical datasets.
pandas is built around two main data structures:
Series → A one-dimensional array with labels (like a single column in a table).
DataFrame → A two-dimensional tabular structure with labeled rows and columns (like a full table).
Why pandas?
Efficient handling of messy real-world data (missing values, duplicates, different formats).
High compatibility with other Python libraries (e.g.,
matplotlib
,seaborn
,scikit-learn
).Widely considered the de facto standard for tabular data in Python.
👉 A great starting point is the official 10-Minute Tutorial.
Alternatives to pandas
While pandas is the most common tool, there are alternatives:
Dask – for parallel processing of very large datasets (bigger than your computer’s memory). Often used in big data and machine learning pipelines.
Polars – a modern, very fast data analysis library with increasing compatibility to pandas-like operations.
# Import pandas
import pandas as pd
Creating a DataFrame with a dictionary#
We usually work with DataFrames in pandas. A DataFrame can be created from many sources:
Python dictionaries
Excel or CSV files
Databases
JSON or APIs
Here we create a simple patient dataset as a dictionary and load it into a DataFrame.
# Example: Patient data
data_dict = {
"patient_id": [1, 2, 3],
"name": ["Anna", "Ben", "Clara"],
"age": [65, 42, 58],
"gender": ["F", "M", "F"],
"admission_date": ["2024-12-01", "2024-12-15", "2024-12-31"],
"blood_pressure": [120, 145, 130]
}
patients = pd.DataFrame(data=data_dict)
patients
patient_id | name | age | gender | admission_date | blood_pressure | |
---|---|---|---|---|---|---|
0 | 1 | Anna | 65 | F | 2024-12-01 | 120 |
1 | 2 | Ben | 42 | M | 2024-12-15 | 145 |
2 | 3 | Clara | 58 | F | 2024-12-31 | 130 |
Exploring a DataFrame#
pandas offers many methods to explore and understand your dataset.
.info()
→ metadata: number of rows, columns, data types, missing values.head(n)
→ shows the firstn
rows (default = 5).tail(n)
→ shows the lastn
rows.describe()
→ summary statistics (mean, min, max, standard deviation) for numeric columns
# Show metadata
patients.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 patient_id 3 non-null int64
1 name 3 non-null object
2 age 3 non-null int64
3 gender 3 non-null object
4 admission_date 3 non-null object
5 blood_pressure 3 non-null int64
dtypes: int64(3), object(3)
memory usage: 276.0+ bytes
# Show first 2 rows
patients.head(2)
patient_id | name | age | gender | admission_date | blood_pressure | |
---|---|---|---|---|---|---|
0 | 1 | Anna | 65 | F | 2024-12-01 | 120 |
1 | 2 | Ben | 42 | M | 2024-12-15 | 145 |
# Show last 2 rows
patients.tail(2)
patient_id | name | age | gender | admission_date | blood_pressure | |
---|---|---|---|---|---|---|
1 | 2 | Ben | 42 | M | 2024-12-15 | 145 |
2 | 3 | Clara | 58 | F | 2024-12-31 | 130 |
# Summary statistics
patients.describe().transpose()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
patient_id | 3.0 | 2.000000 | 1.000000 | 1.0 | 1.5 | 2.0 | 2.5 | 3.0 |
age | 3.0 | 55.000000 | 11.789826 | 42.0 | 50.0 | 58.0 | 61.5 | 65.0 |
blood_pressure | 3.0 | 131.666667 | 12.583057 | 120.0 | 125.0 | 130.0 | 137.5 | 145.0 |
Selecting Columns and Rows#
You can select columns and rows in different ways.
### Column names
patients.columns
Index(['patient_id', 'name', 'age', 'gender', 'admission_date',
'blood_pressure'],
dtype='object')
### Row index
patients.index
RangeIndex(start=0, stop=3, step=1)
By default, pandas assigns a numeric index (0, 1, 2…).
We can also use an existing column (like patient_id
) as the index.
patients_new = patients.set_index("patient_id")
patients_new.head()
name | age | gender | admission_date | blood_pressure | |
---|---|---|---|---|---|
patient_id | |||||
1 | Anna | 65 | F | 2024-12-01 | 120 |
2 | Ben | 42 | M | 2024-12-15 | 145 |
3 | Clara | 58 | F | 2024-12-31 | 130 |
Selecting Columns#
We can select:
A single column → returns a Series
Multiple columns → returns a DataFrame
# One column
patients_new["age"]
patient_id
1 65
2 42
3 58
Name: age, dtype: int64
# Multiple columns
patients_new[["age", "blood_pressure"]]
age | blood_pressure | |
---|---|---|
patient_id | ||
1 | 65 | 120 |
2 | 42 | 145 |
3 | 58 | 130 |
Row Selection (Label-based with loc)#
With .loc[]
, we can select rows by their labels (patient_id in this case).
# Select patient with ID 1
patients_new.loc[1]
name Anna
age 65
gender F
admission_date 2024-12-01
blood_pressure 120
Name: 1, dtype: object
# Select multiple patients
patients_new.loc[[1, 3]]
name | age | gender | admission_date | blood_pressure | |
---|---|---|---|---|---|
patient_id | |||||
1 | Anna | 65 | F | 2024-12-01 | 120 |
3 | Clara | 58 | F | 2024-12-31 | 130 |
# Select rows + specific columns
patients_new.loc[[1,2], ["age", "blood_pressure"]]
age | blood_pressure | |
---|---|---|
patient_id | ||
1 | 65 | 120 |
2 | 42 | 145 |
Row Selection (Integer-based with iloc)#
With .iloc[]
, we select rows and columns by their integer positions.
# First row (position 0)
patients_new.iloc[0]
name Anna
age 65
gender F
admission_date 2024-12-01
blood_pressure 120
Name: 1, dtype: object
# First and third row, first two columns
patients_new.iloc[[0, 2], :2]
name | age | |
---|---|---|
patient_id | ||
1 | Anna | 65 |
3 | Clara | 58 |
Boolean Indexing (Filtering Rows)#
Boolean indexing allows us to filter rows based on conditions.
# All female patients
patients_new[patients_new["gender"] == "F"]
name | age | gender | admission_date | blood_pressure | |
---|---|---|---|---|---|
patient_id | |||||
1 | Anna | 65 | F | 2024-12-01 | 120 |
3 | Clara | 58 | F | 2024-12-31 | 130 |
# Patients older than 50
patients_new[patients_new["age"] > 50]
name | age | gender | admission_date | blood_pressure | |
---|---|---|---|---|---|
patient_id | |||||
1 | Anna | 65 | F | 2024-12-01 | 120 |
3 | Clara | 58 | F | 2024-12-31 | 130 |
# Combine conditions: female patients older than 50
patients_new[(patients_new["gender"] == "F") & (patients_new["age"] > 50)]
name | age | gender | admission_date | blood_pressure | |
---|---|---|---|---|---|
patient_id | |||||
1 | Anna | 65 | F | 2024-12-01 | 120 |
3 | Clara | 58 | F | 2024-12-31 | 130 |
Exercise#
Select all patients older than 50.
Show only the
name
andblood_pressure
columns for patients with blood pressure > 130.Use
.iloc
to return the first two rows and the last two columns.