Welcome back! In Session 1 you set up JupyterLab, met the AI assistant bia-bob, and used it to explore a dataset and draft documents like a data analysis plan or a Jupyter notebook.
In Session 2 we keep the same workbench and assistant, and introduce a small toolbelt of Python packages for everyday data steward tasks: validating, profiling, cleaning, documenting, and (optionally) depositing data.
Throughout this session the rule is AI-first, but always understand and verify:
we let the assistant draft the code, then we read it, run it, and check the result.
Learning objectives¶
By the end of Session 2 you will be able to:
Recognise common data quality problems in an example raw incoming dataset.
Define and enforce a data schema (a “data contract”) with
pandera.Generate a shareable data quality report with
fg-data-profiling(formerly namedydata-profiling).Apply a few light cleaning steps and confirm they worked.
Describe and bundle your data into a documented, FAIR data package with
frictionless.Draft a DMP section from what you found.
(Optional) Deposit a package to a repository (Zenodo Sandbox) via its API.
The data steward pipeline¶
The notebooks in this session follow one realistic workflow. Imagine a research project hands you a raw data file. Your job is to turn it into something validated, documented, and ready to hand over or deposit. So, our pipeline may look like this:
EXPLORE → VALIDATE → REPORT & CLEAN → PACKAGE → DMP & DEPOSIT
| Notebook | Stage | Tool |
|---|---|---|
2_data-validation | Validate | pandera |
3_quality-report | Report & clean | fg-data-profiling |
4_metadata-packaging | Describe & package | frictionless |
5_dmp-and-deposit (optional) | DMP & deposit | assistant + Zenodo API |
Each tool is one item on the toolbelt. You don’t need to memorise them - you need to know what each one is for and how to ask the assistant to use it correctly.
Initialize the AI assistant¶
This is the same initialization you saved at the end of Session 1. It reads your endpoint URL,
API key, and the Data Steward system prompt from your secrets file.
Reminder: JupyterLab must have been started with your
secretsfile:uv run --env-file path/to/secrets jupyter lab
# Import os to read env vars
import os
# Import bob and helper functions
from bia_bob import bob, available_models, fix, doc
# Initialize assistant, reads API key from env var automatically
bob.initialize(
# read endpoint URL from env var
endpoint=os.getenv('ENDPOINT_URL'),
# select an available model for your purpose
model="mini",
# read the Data Steward system prompt from env var
system_prompt=os.getenv('SYSTEM_PROMPT_DATA_STEWARD'),
)Let’s confirm the assistant is available and in its Data Steward role:
%bob Briefly, in two sentences: what is your role, and what can you help me with ?Our working data: a raw incoming dataset¶
For this session we work with ../data/field_samples_raw.csv - a small dataset of biological field samples that a project has handed over. It is raw: nobody has validated or cleaned it yet.
A data dictionary describing what the clean data should look like is provided in ../data/data_dictionary.md. Open it now in JupyterLab (right-click > Open With > Markdown Preview) and keep it handy, it is our reference for “what good looks like”.
Let’s take a first look with pandas. If pandas is not already available in your venv, you can use uv to install (add) it:
#!uv add pandasIf you do not understand what the following code does, you may ask your assistant ;)
import pandas as pd
df = pd.read_csv("../data/field_samples_raw.csv")
print("Shape (rows, columns):", df.shape)
df.head(10)Shape (rows, columns): (182, 9)
Take a moment to look. Even from the first rows you can probably already spot trouble, e.g., in the species, sex, body_mass_g, or date_collected columns.
Let’s ask the assistant for a quick first impression, using prompt augmentation via {}. However, always keep in mind the potential size of the augmented variable related to the available context size of the model behinf your assistant!
Notice we ask for a short text answer, no code - we just want orientation:
%%bob
Here is a raw dataset I just received: {df}
Give me a short, plain-language list of the most obvious data quality problems you notice.
Just a short bulleted text answer. No code, no cleaning yet.Keep the assistant’s list in mind, but remember to verify it against what you see. In the next notebooks we will systematically catch and fix these issues with the toolbelt, instead of eyeballing them.
What’s next¶
We continue with 2_data-validation.ipynb, where we turn the data dictionary into an enforceable schema and let it tell us exactly which records break the rules.