Now that we are familiar with the basic usage of our AI Assistant, we want to use it for more advanced task. Those include:
Help with planning and drafting, e.g. data exploration
Reading, exploring and analysing data from files
Creating Jupyter Notebooks or Markdown files
# Import os to read env vars
import os
# Import bob and helper functions
from bia_bob import bob, available_models, fix, doc
# Initialize assistant, reads API key from env var automatically
bob.initialize(
# read endpoint URL from env var
endpoint=os.getenv('ENDPOINT_URL'),
# select an available model for your purpose
model="mini",
# read system prompt from env var
system_prompt=os.getenv('SYSTEM_PROMPT_DATA_STEWARD'),
)Drafting Plans¶
%%bob
I have data in a csv file. I want to explore and validate the data with the final goal to
include the gained information in a DMP I need to write for an according data science project.
What are the steps and tasks I need to consider ?
Help me planning the data analysis.
Just a short text explaining the coarse plan for now. No code !
Store this text in a Markdown file "my-plan.md".Loading...
Generating Code¶
We can now use our AI assistant to tackle the separate steps, e.g., basic data exploration and profiling. Let’s ask for according Python code:
%%bob
Generate Python code in a code cell below to do the following:
* Read the file in "../data/penguins_raw.csv/"
* Explore basic data properties and statistics, include missing values and outliers
* Keep the code as short as possible
* Use short and precise commments
* Do not assume any values or column names but read and use the ones provided with the data# ------------------------------------------------------------
# Basic exploration of ../data/penguins_raw.csv
# ------------------------------------------------------------
import pandas as pd # data handling
from pathlib import Path # portable file paths
# 1️⃣ Load the CSV (auto-detect delimiter, encoding, etc.)
data_path = Path("..") / "data" / "penguins_raw.csv"
df = pd.read_csv(data_path)
# ------------------------------------------------------------------
# 2️⃣ Core data overview (rows, columns, types, preview)
# ------------------------------------------------------------------
print("Shape (rows, cols):", df.shape)
print("\nColumn types:")
print(df.dtypes)
print("\nFirst 5 rows:")
display(df.head())
# ------------------------------------------------------------------
# 3️⃣ Summary statistics for *numeric* columns
# ------------------------------------------------------------------
print("\nNumeric summary:")
display(df.describe(include='number').transpose())
# ------------------------------------------------------------------
# 4️⃣ Missing‑value report (count per column)
# ------------------------------------------------------------------
missing = df.isnull().sum()
print("\nMissing values per column:")
print(missing[missing > 0])
# ------------------------------------------------------------------
# 5️⃣ Simple outlier detection (IQR method) for numeric columns
# ------------------------------------------------------------------
def iqr_outliers(series):
"""Return boolean mask of outlier rows for a numeric Series."""
q1, q3 = series.quantile([0.25, 0.75])
iqr = q3 - q1
lower, upper = q1 - 1.5 * iqr, q3 + 1.5 * iqr
return (series < lower) | (series > upper)
outlier_counts = {}
for col in df.select_dtypes(include='number').columns:
mask = iqr_outliers(df[col].dropna())
outlier_counts[col] = mask.sum()
print("\nPotential outliers (IQR) per numeric column:")
print({k: v for k, v in outlier_counts.items() if v > 0})Shape (rows, cols): (344, 17)
Column types:
studyName str
Sample Number int64
Species str
Region str
Island str
Stage str
Individual ID str
Clutch Completion str
Date Egg str
Culmen Length (mm) float64
Culmen Depth (mm) float64
Flipper Length (mm) float64
Body Mass (g) float64
Sex str
Delta 15 N (o/oo) float64
Delta 13 C (o/oo) float64
Comments str
dtype: object
First 5 rows:
Loading...
Numeric summary:
Loading...
Missing values per column:
Culmen Length (mm) 2
Culmen Depth (mm) 2
Flipper Length (mm) 2
Body Mass (g) 2
Sex 11
Delta 15 N (o/oo) 14
Delta 13 C (o/oo) 13
Comments 290
dtype: int64
Potential outliers (IQR) per numeric column:
{}
Generating Notebooks¶
Now we want to have a more detailed data exploration and profiling within a dedicated Jupyter notebook, including sufficient documentation and explaination.
For this, we may also re-initialize our AI assistant and use a model more specialized for coding.
Note: more sophisticated and bigger models may need longer to generate the answer.
# Initialize assistant, reads API key from env var automatically
bob.initialize(
# read endpoint URL from env var
endpoint=os.getenv('ENDPOINT_URL'),
# select an available model for your purpose
model="coder",
# read system prompt from env var
system_prompt=os.getenv('SYSTEM_PROMPT_DATA_STEWARD'),
)%%bob
Generate a Jupyer notebook called "penguin_data_exploration", which does the following:
* Read the file in "../data/penguins_raw.csv/"
* Explore the data properties and statistics, include missing values and outliers
* Visualize basic data distributions
* Do not assume any values or column names but read and use the ones provided with the data
* Structure the notebook so that it is clear and self-explanatory
* Provide Markdown cells for documentation which explain each step and why it is neseccary
* Provide a summary of the exploration and its findings at the end of the notebook - the summary shall be included in a DMPLoading...