Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Session 1.4: AI Assistant for Data Steward Tasks

DSC ScaDS.AI, Leipzig University

Now that we are familiar with the basic usage of our AI Assistant, we want to use it for more advanced task. Those include:

  • Help with planning and drafting, e.g. data exploration

  • Reading, exploring and analysing data from files

  • Creating Jupyter Notebooks or Markdown files

# Import os to read env vars
import os
# Import bob and helper functions
from bia_bob import bob, available_models, fix, doc
# Initialize assistant, reads API key from env var automatically
bob.initialize(
    # read endpoint URL from env var
    endpoint=os.getenv('ENDPOINT_URL'),
    # select an available model for your purpose
    model="mini",
    # read system prompt from env var
    system_prompt=os.getenv('SYSTEM_PROMPT_DATA_STEWARD'),
)

Drafting Plans

%%bob
I have data in a csv file. I want to explore and validate the data with the final goal to 
include the gained information in a DMP I need to write for an according data science project.
What are the steps and tasks I need to consider ?
Help me planning the data analysis.
Just a short text explaining the coarse plan for now. No code !
Store this text in a Markdown file "my-plan.md".
Loading...

Generating Code

We can now use our AI assistant to tackle the separate steps, e.g., basic data exploration and profiling. Let’s ask for according Python code:

%%bob
Generate Python code in a code cell below to do the following:
* Read the file in "../data/penguins_raw.csv/"
* Explore basic data properties and statistics, include missing values and outliers
* Keep the code as short as possible
* Use short and precise commments
* Do not assume any values or column names but read and use the ones provided with the data
# ------------------------------------------------------------
# Basic exploration of ../data/penguins_raw.csv
# ------------------------------------------------------------
import pandas as pd               # data handling
from pathlib import Path         # portable file paths

# 1️⃣ Load the CSV (auto-detect delimiter, encoding, etc.)
data_path = Path("..") / "data" / "penguins_raw.csv"
df = pd.read_csv(data_path)

# ------------------------------------------------------------------
# 2️⃣ Core data overview (rows, columns, types, preview)
# ------------------------------------------------------------------
print("Shape (rows, cols):", df.shape)
print("\nColumn types:")
print(df.dtypes)
print("\nFirst 5 rows:")
display(df.head())

# ------------------------------------------------------------------
# 3️⃣ Summary statistics for *numeric* columns
# ------------------------------------------------------------------
print("\nNumeric summary:")
display(df.describe(include='number').transpose())

# ------------------------------------------------------------------
# 4️⃣ Missing‑value report (count per column)
# ------------------------------------------------------------------
missing = df.isnull().sum()
print("\nMissing values per column:")
print(missing[missing > 0])

# ------------------------------------------------------------------
# 5️⃣ Simple outlier detection (IQR method) for numeric columns
# ------------------------------------------------------------------
def iqr_outliers(series):
    """Return boolean mask of outlier rows for a numeric Series."""
    q1, q3 = series.quantile([0.25, 0.75])
    iqr = q3 - q1
    lower, upper = q1 - 1.5 * iqr, q3 + 1.5 * iqr
    return (series < lower) | (series > upper)

outlier_counts = {}
for col in df.select_dtypes(include='number').columns:
    mask = iqr_outliers(df[col].dropna())
    outlier_counts[col] = mask.sum()

print("\nPotential outliers (IQR) per numeric column:")
print({k: v for k, v in outlier_counts.items() if v > 0})
Shape (rows, cols): (344, 17)

Column types:
studyName                  str
Sample Number            int64
Species                    str
Region                     str
Island                     str
Stage                      str
Individual ID              str
Clutch Completion          str
Date Egg                   str
Culmen Length (mm)     float64
Culmen Depth (mm)      float64
Flipper Length (mm)    float64
Body Mass (g)          float64
Sex                        str
Delta 15 N (o/oo)      float64
Delta 13 C (o/oo)      float64
Comments                   str
dtype: object

First 5 rows:
Loading...

Numeric summary:
Loading...

Missing values per column:
Culmen Length (mm)       2
Culmen Depth (mm)        2
Flipper Length (mm)      2
Body Mass (g)            2
Sex                     11
Delta 15 N (o/oo)       14
Delta 13 C (o/oo)       13
Comments               290
dtype: int64

Potential outliers (IQR) per numeric column:
{}

Generating Notebooks

Now we want to have a more detailed data exploration and profiling within a dedicated Jupyter notebook, including sufficient documentation and explaination.

For this, we may also re-initialize our AI assistant and use a model more specialized for coding.
Note: more sophisticated and bigger models may need longer to generate the answer.

# Initialize assistant, reads API key from env var automatically
bob.initialize(
    # read endpoint URL from env var
    endpoint=os.getenv('ENDPOINT_URL'),
    # select an available model for your purpose
    model="coder",
    # read system prompt from env var
    system_prompt=os.getenv('SYSTEM_PROMPT_DATA_STEWARD'),
)
%%bob
Generate a Jupyer notebook called "penguin_data_exploration", which does the following:
* Read the file in "../data/penguins_raw.csv/"
* Explore the data properties and statistics, include missing values and outliers
* Visualize basic data distributions
* Do not assume any values or column names but read and use the ones provided with the data
* Structure the notebook so that it is clear and self-explanatory
* Provide Markdown cells for documentation which explain each step and why it is neseccary
* Provide a summary of the exploration and its findings at the end of the notebook - the summary shall be included in a DMP
Loading...