🐼 Introduction to pandas#

pandas is an open-source Python library for working with and analyzing tabular data (similar to Excel spreadsheets) and time series. It is extremely useful in medical research because it makes it easy to handle patient data, lab results, ICU data, and many other clinical datasets.

pandas is built around two main data structures:

  • Series → A one-dimensional array with labels (like a single column in a table).

  • DataFrame → A two-dimensional tabular structure with labeled rows and columns (like a full table).


Why pandas?

  • Efficient handling of messy real-world data (missing values, duplicates, different formats).

  • High compatibility with other Python libraries (e.g., matplotlib, seaborn, scikit-learn).

  • Widely considered the de facto standard for tabular data in Python.

👉 A great starting point is the official 10-Minute Tutorial.


Alternatives to pandas

While pandas is the most common tool, there are alternatives:

  • Dask – for parallel processing of very large datasets (bigger than your computer’s memory). Often used in big data and machine learning pipelines.

  • Polars – a modern, very fast data analysis library with increasing compatibility to pandas-like operations.

# Import pandas
import pandas as pd

Creating a DataFrame with a dictionary#

We usually work with DataFrames in pandas. A DataFrame can be created from many sources:

  • Python dictionaries

  • Excel or CSV files

  • Databases

  • JSON or APIs

Here we create a simple patient dataset as a dictionary and load it into a DataFrame.

# Example: Patient data
data_dict = {
    "patient_id": [1, 2, 3],
    "name": ["Anna", "Ben", "Clara"],
    "age": [65, 42, 58],
    "gender": ["F", "M", "F"],
    "admission_date": ["2024-12-01", "2024-12-15", "2024-12-31"],
    "blood_pressure": [120, 145, 130]
}

patients = pd.DataFrame(data=data_dict)

patients
patient_id name age gender admission_date blood_pressure
0 1 Anna 65 F 2024-12-01 120
1 2 Ben 42 M 2024-12-15 145
2 3 Clara 58 F 2024-12-31 130

Exploring a DataFrame#

pandas offers many methods to explore and understand your dataset.

  • .info() → metadata: number of rows, columns, data types, missing values

  • .head(n) → shows the first n rows (default = 5)

  • .tail(n) → shows the last n rows

  • .describe() → summary statistics (mean, min, max, standard deviation) for numeric columns

# Show metadata
patients.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   patient_id      3 non-null      int64 
 1   name            3 non-null      object
 2   age             3 non-null      int64 
 3   gender          3 non-null      object
 4   admission_date  3 non-null      object
 5   blood_pressure  3 non-null      int64 
dtypes: int64(3), object(3)
memory usage: 276.0+ bytes
# Show first 2 rows
patients.head(2)
patient_id name age gender admission_date blood_pressure
0 1 Anna 65 F 2024-12-01 120
1 2 Ben 42 M 2024-12-15 145
# Show last 2 rows
patients.tail(2)
patient_id name age gender admission_date blood_pressure
1 2 Ben 42 M 2024-12-15 145
2 3 Clara 58 F 2024-12-31 130
# Summary statistics
patients.describe().transpose()
count mean std min 25% 50% 75% max
patient_id 3.0 2.000000 1.000000 1.0 1.5 2.0 2.5 3.0
age 3.0 55.000000 11.789826 42.0 50.0 58.0 61.5 65.0
blood_pressure 3.0 131.666667 12.583057 120.0 125.0 130.0 137.5 145.0

Selecting Columns and Rows#

You can select columns and rows in different ways.

### Column names
patients.columns
Index(['patient_id', 'name', 'age', 'gender', 'admission_date',
       'blood_pressure'],
      dtype='object')
### Row index
patients.index
RangeIndex(start=0, stop=3, step=1)

By default, pandas assigns a numeric index (0, 1, 2…). We can also use an existing column (like patient_id) as the index.

patients_new = patients.set_index("patient_id")
patients_new.head()
name age gender admission_date blood_pressure
patient_id
1 Anna 65 F 2024-12-01 120
2 Ben 42 M 2024-12-15 145
3 Clara 58 F 2024-12-31 130

Selecting Columns#

We can select:

  • A single column → returns a Series

  • Multiple columns → returns a DataFrame

# One column
patients_new["age"]
patient_id
1    65
2    42
3    58
Name: age, dtype: int64
# Multiple columns
patients_new[["age", "blood_pressure"]]
age blood_pressure
patient_id
1 65 120
2 42 145
3 58 130

Row Selection (Label-based with loc)#

With .loc[], we can select rows by their labels (patient_id in this case).

# Select patient with ID 1
patients_new.loc[1]
name                    Anna
age                       65
gender                     F
admission_date    2024-12-01
blood_pressure           120
Name: 1, dtype: object
# Select multiple patients
patients_new.loc[[1, 3]]
name age gender admission_date blood_pressure
patient_id
1 Anna 65 F 2024-12-01 120
3 Clara 58 F 2024-12-31 130
# Select rows + specific columns
patients_new.loc[[1,2], ["age", "blood_pressure"]]
age blood_pressure
patient_id
1 65 120
2 42 145

Row Selection (Integer-based with iloc)#

With .iloc[], we select rows and columns by their integer positions.

# First row (position 0)
patients_new.iloc[0]
name                    Anna
age                       65
gender                     F
admission_date    2024-12-01
blood_pressure           120
Name: 1, dtype: object
# First and third row, first two columns
patients_new.iloc[[0, 2], :2]
name age
patient_id
1 Anna 65
3 Clara 58

Boolean Indexing (Filtering Rows)#

Boolean indexing allows us to filter rows based on conditions.

# All female patients
patients_new[patients_new["gender"] == "F"]
name age gender admission_date blood_pressure
patient_id
1 Anna 65 F 2024-12-01 120
3 Clara 58 F 2024-12-31 130
# Patients older than 50
patients_new[patients_new["age"] > 50]
name age gender admission_date blood_pressure
patient_id
1 Anna 65 F 2024-12-01 120
3 Clara 58 F 2024-12-31 130
# Combine conditions: female patients older than 50
patients_new[(patients_new["gender"] == "F") & (patients_new["age"] > 50)]
name age gender admission_date blood_pressure
patient_id
1 Anna 65 F 2024-12-01 120
3 Clara 58 F 2024-12-31 130

Exercise#

  1. Select all patients older than 50.

  2. Show only the name and blood_pressure columns for patients with blood pressure > 130.

  3. Use .iloc to return the first two rows and the last two columns.