Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Session 2.4: Metadata & FAIR packaging

DSC ScaDS.AI, Leipzig University

Stage of the pipeline:

EXPLORE → VALIDATE → REPORT & CLEAN → PACKAGE → DMP & DEPOSIT

We now have cleaned, validated data and a record of what was wrong. The last step before a DMP or a deposit is to make the data self-describing: bundle it with metadata so that someone who has never met us can understand and reuse it. That is the core of the FAIR principles (Findable, Accessible, Interoperable, Reusable).

We use frictionless, which builds a Data Package - a small, standard, portable container described by a single datapackage.json file. Research repositories (including Zenodo) understand this format.

After this notebook you will be able to: infer a table schema from data, enrich it with human-readable metadata, bundle everything into a validated Data Package, and explain how this “structural” validation differs from the “content” validation we did with pandera.

But first, initialize our AI assistant:

# Import os to read env vars
import os
# Import bob and helper functions
from bia_bob import bob, available_models, fix, doc

# Initialize assistant, reads API key from env var automatically
bob.initialize(
    # read endpoint URL from env var
    endpoint=os.getenv('ENDPOINT_URL'),
    # select an available model for your purpose
    model="coder",
    # read the Data Steward system prompt from env var
    system_prompt=os.getenv('SYSTEM_PROMPT_DATA_STEWARD'),
)

If frictionless is not already available in your venv, you can use uv to install (add) it:

#!uv add frictionless

Import needed objects from frictionless

from frictionless import Package, Resource, Schema, describe

Let frictionless describe the data for us

describe() reads the file, infers each column’s type, and produces a Table Schema - all automatically. This is our starting point and we will refine it.

resource = describe("../outputs/field_samples_clean.csv")

# Show the field types frictionless inferred
for field in resource.schema.fields:
    print(f"{field.name:20s} -> {field.type}")
sample_id            -> string
species              -> string
island               -> string
date_collected       -> date
body_mass_g          -> integer
flipper_length_mm    -> number
sex                  -> string
region               -> string
notes                -> string

Notice it correctly inferred date_collected as a date, body_mass_g as an integer, and flipper_length_mm as a number - because we cleaned the data before. On the raw data these would all have been plain string. Clean data describes itself better.

Enrich the schema with human-readable metadata

Inferred types are not enough for reuse - a future user needs to know what each field means, its units, and its allowed values. These are exactly the descriptions we already wrote in our pandera schema (Notebook 2) and the data dictionary. Let’s reuse them.

We can ask the assistant to help draft any we are missing:

%%bob
I am writing field descriptions for a frictionless Table Schema describing 
penguin field samples. The columns are: sample_id, species, island, date_collected, body_mass_g,
flipper_length_mm, sex, region, notes. 
Suggest a one-line description for each, noting units where relevant. 
Return them as a Python dict called `descriptions`. Just the code.

Here are the reference descriptions (the same wording as our pandera schema, so the two artifacts stay consistent):

descriptions_reference = {
    "sample_id": "Unique sample identifier, pattern S + 4 digits (e.g. S0001).",
    "species": "Penguin species (controlled vocabulary).",
    "island": "Island in the Palmer Archipelago (controlled vocabulary).",
    "date_collected": "Date the sample was collected, ISO 8601 (YYYY-MM-DD).",
    "body_mass_g": "Body mass in grams.",
    "flipper_length_mm": "Flipper length in millimetres.",
    "sex": "Recorded sex; empty means unknown/not recorded.",
    "region": "Broad geographic region (constant across the dataset).",
    "notes": "Optional free-text field-collection remarks.",
}

# attach a description to every field
for field in resource.schema.fields:
    field.description = descriptions_reference.get(field.name, "")

# we can also bake a controlled vocabulary into the schema as a constraint
for field in resource.schema.fields:
    if field.name == "species":
        field.constraints["enum"] = ["Adelie", "Chinstrap", "Gentoo"]

print("Field metadata added.")
Field metadata added.

Add dataset-level metadata and build the package

Field-level metadata describes the columns; package-level metadata describes the dataset as a whole: title, description, licence, contributors, keywords.

This is what makes a dataset Findable and Reusable. Ask the assistant for help with a description and keywords if you like, then assemble the Package:

# make the package self-contained: path relative to the package folder
resource.path = "field_samples_clean.csv"

package = Package(
    name="field-samples-2026",
    title="Penguin Field Samples 2026",
    description=(
        "Cleaned and validated biological field-sample measurements "
        "(species, island, date, body mass, flipper length, sex)."
    ),
    resources=[resource],
    basepath="../outputs",
    licenses=[{"name": "CC-BY-4.0", "title": "Creative Commons Attribution 4.0"}],
    contributors=[{"title": "Data Steward", "role": "author"}],
    keywords=["penguins", "biology", "field samples", "FAIR"],
)

# Save the descriptor - this single file makes the data self-describing
from pathlib import Path
Path("../outputs").mkdir(exist_ok=True)
package.to_json("../outputs/datapackage.json")
print("Wrote ../outputs/datapackage.json")
Wrote ../outputs/datapackage.json

Let’s look at what we produced. A datapackage.json is just readable JSON - you can open it in JupyterLab, or print the field section here:

import json

descriptor = json.loads(Path("../outputs/datapackage.json").read_text())
print("Package keys:", list(descriptor.keys()))
print("\nFirst two described fields:")
for field in descriptor["resources"][0]["schema"]["fields"][:2]:
    print(json.dumps(field, indent=2))
Package keys: ['name', 'title', 'description', 'licenses', 'contributors', 'keywords', 'resources']

First two described fields:
{
  "name": "sample_id",
  "type": "string",
  "description": "Unique sample identifier, pattern S + 4 digits (e.g. S0001)."
}
{
  "name": "species",
  "type": "string",
  "description": "Penguin species (controlled vocabulary).",
  "constraints": {
    "enum": [
      "Adelie",
      "Chinstrap",
      "Gentoo"
    ]
  }
}

Validate the package

Finally, frictionless checks that the data matches the schema we described and that the package is well-formed:

report = package.validate()

print("Package valid:", report.valid)
if not report.valid:
    # show any problems in a readable table
    print(report.flatten(["rowNumber", "fieldName", "type", "note"]))
Package valid: True

Two kinds of validation (and why we did both)

It can seem like we validated twice. We did, but for different purposes:

pandera (Notebooks 2–3)frictionless (here)
Question it answersAre the values correct? (ranges, required, vocabularies)Does the file match its described schema, and is the package well-formed and portable?
Strengthrich, in-pipeline checks on a DataFramea portable metadata + packaging standard that repositories understand
Outputa list of bad records to fix or flaga self-describing, deposit-ready package

They are complementary. frictionless can also enforce value constraints (enum, minimum, maximum, required) if you declare them - we added an enum for species above - so you can choose how much of the contract to carry inside the package itself.

Remember the flagged records from Notebook 3. Structural validity here does not mean the data is perfect - the missing and out-of-range values we flagged are still worth documenting. A good package records known limitations (e.g. in the package description or an accompanying README), rather than hiding them.

What we have now

A single folder containing the cleaned data plus a datapackage.json that describes it in a standard, machine-readable, FAIR-aligned way - portable and ready to hand over or deposit.

We will continue with 5_dmp-and-deposit.ipynb to draft a Data Management Plan section from everything we have learned, and optionally deposit the package to a repository.