Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Session 2.5: Data Management Plan & deposit

DSC ScaDS.AI, Leipzig University

Stage of the pipeline:

EXPLORE → VALIDATE → REPORT & CLEAN → PACKAGE → DMP & DEPOSIT

This optional notebook closes the loop. We have a validated, documented, packaged dataset. Two things remain that a data steward typically does:

  • Part A - draft a Data Management Plan (DMP) section from everything we learned, with the AI assistant doing the heavy lifting and us reviewing it.

  • Part B deposit the package to a repository so it gets a persistent identifier (DOI). We use the Zenodo Sandbox, a safe test environment, so nothing goes public.

Part B is designed to work either as a hands-on task (with your own sandbox token) or as a live demo by the instructor. If you have no token, the deposit cells simply skip themselves and you can still follow along.

What we have to work with

From the previous notebooks:

  • ../outputs/field_samples_clean.csv - cleaned, validated data

  • ../outputs/data_schema.yml - the data contract (pandera)

  • ../outputs/datapackage.json - FAIR metadata + data dictionary (frictionless)

  • ../outputs/quality_report_clean.html - shareable quality report

  • a record of flagged issues (missing required values, impossible measurements)

A DMP is where all of this comes together in human-readable form.

But first, initialize our AI assistant:

# Import os to read env vars
import os
# Import bob and helper functions
from bia_bob import bob, available_models, fix, doc

# Initialize assistant, reads API key from env var automatically
bob.initialize(
    # read endpoint URL from env var
    endpoint=os.getenv('ENDPOINT_URL'),
    # select an available model for your purpose
    model="mini",
    # read the Data Steward system prompt from env var
    system_prompt=os.getenv('SYSTEM_PROMPT_DATA_STEWARD'),
)

Part A - Draft a Data Management Plan

A DMP describes how data is handled during and after a project: what the data is, how it is made FAIR, how it is stored, licensed, and preserved. Funders (e.g. Horizon Europe) require one.

It is a document the steward owns and is accountable for - but the assistant is excellent at turning our scattered findings into a structured first draft. This is exactly the “AI-first, then verify” pattern: the assistant drafts, you check every claim against what you actually did.

Give the assistant the facts, ask for a DMP draft

We feed in the concrete things we did and learned. Notice we are not asking it to invent anything - we give it the real findings and ask it to structure them.

%%bob
Draft a Data Management Plan section for a small research dataset, following the
Horizon Europe DMP structure (1. Data summary; 2. FAIR data with subsections on findable,
accessible, interoperable, reusable; plus short notes on storage/backup and licensing).
Write it in Markdown and store it to "dmp_draft.md"

Use ONLY these facts and do not invent details:
- Dataset: biological field samples of penguins (species, island, collection date, body mass,
  flipper length, sex), ~180 records, CSV format.
- Provenance: raw CSV handed over from a research project; cleaned and validated by the data steward.
- Validation: a pandera schema enforces types, controlled vocabularies (species, island, sex),
  value ranges, and a unique sample_id. Stored as data_schema.yml.
- Quality: a profiling report was produced. Known remaining issues are documented and flagged
  back to the data provider: some missing required values and a few impossible/out-of-range
  measurements (NOT silently changed).
- Packaging: a Frictionless Data Package (datapackage.json) carries a data dictionary and metadata.
- Findability/PID: the package will be deposited to a repository (Zenodo) which mints a DOI.
- Interoperability: open CSV format, Frictionless Table Schema, controlled vocabularies.
- Licensing: CC-BY-4.0.
- Storage: kept with version control; repository handles long-term preservation.
Loading...

The assistant will produce a structured draft similar to the reference below. Read it critically - does every statement match what we actually did? A DMP with claims you cannot back up is worse than no DMP.

Example DMP draft (click to expand)
# Data Management Plan - Penguin Field Samples 2026

## 1. Data summary
- **Purpose:** Biological field-sample measurements supporting research on penguin populations.
- **Types & formats:** One tabular dataset (~180 records) in open CSV format.
- **Variables:** sample_id, species, island, date_collected, body_mass_g, flipper_length_mm, sex, region, notes.
- **Origin:** Raw CSV handed over from a research project; cleaned and validated by the data steward.
- **Expected size:** < 1 MB.
- **Utility:** Researchers in ecology/biology; reusable as an example of curated field data.

## 2. FAIR data
### 2.1 Findable
- The dataset will be deposited in Zenodo, which mints a persistent **DOI**.
- Rich metadata (title, description, keywords, contributors) is recorded in `datapackage.json`.

### 2.2 Accessible
- Deposited openly; metadata and data retrievable via the repository over standard HTTP(S).

### 2.3 Interoperable
- Open CSV format with a **Frictionless Table Schema** describing every field.
- **Controlled vocabularies** for species, island, and sex ensure consistent interpretation.

### 2.4 Reusable
- Released under **CC-BY-4.0**.
- A **pandera** schema documents the data contract (types, ranges, vocabularies, unique IDs).
- Data quality was assessed; **known issues** (some missing required values, a few impossible
  measurements) are documented and were flagged to the data provider rather than altered.

## Storage & backup
- Working files kept under version control; the repository provides long-term preservation.

## Licensing
- CC-BY-4.0 (attribution required).

Part B - Deposit to the Zenodo Sandbox (optional)

Depositing the package gives it a home and a DOI (a persistent identifier), making it truly Findable. We use the Zenodo Sandbox - a testing copy of Zenodo. Nothing here is permanent or public: sandbox records use fake DOIs and the sandbox can be wiped at any time.

Preparation (do once, before the session):

  1. Register at https://sandbox.zenodo.org (this is separate from the real zenodo.org).

  2. Create a personal access token: Applications → Personal access tokens → New token, with the deposit:write and deposit:actions scopes.

  3. Add it to your secrets file as ZENODO_SANDBOX_TOKEN=...

  4. Restart JupyterLab with --env-file, exactly like your other secrets.

Never commit a token to git or any other repository.

Read the token (and skip gracefully if it is missing)

import os
import requests

ZENODO_TOKEN = os.getenv("ZENODO_SANDBOX_TOKEN")
# SANDBOX only — not the real Zenodo
BASE = "https://sandbox.zenodo.org/api"

HAVE_TOKEN = bool(ZENODO_TOKEN)
print("Token found - deposit cells will run." if HAVE_TOKEN
      else "No ZENODO_SANDBOX_TOKEN set - deposit cells will skip. Follow the demo instead.")
Token found - deposit cells will run.

Step 1 - create an empty deposition

This reserves a record (and a DOI) on the sandbox. We keep the returned id and the bucket URL (a folder-like target we upload files into).

if HAVE_TOKEN:
    headers = {"Authorization": f"Bearer {ZENODO_TOKEN}"}
    r = requests.post(f"{BASE}/deposit/depositions", json={}, headers=headers)
    r.raise_for_status()
    deposition = r.json()
    deposition_id = deposition["id"]
    bucket_url = deposition["links"]["bucket"]
    draft_doi = deposition["metadata"]["prereserve_doi"]["doi"]
    print("Created deposition:", deposition_id)
    print("Reserved test DOI:", draft_doi)
    print("View draft:", deposition["links"]["html"])
Created deposition: 508226
Reserved test DOI: 10.5281/zenodo.508226
View draft: https://sandbox.zenodo.org/deposit/508226

Step 2 - upload the package files

We upload the two files that make up our self-contained package: the data and its descriptor. The new files API is a simple PUT of each file into the bucket.

if HAVE_TOKEN:
    files_to_upload = [
        "../outputs/field_samples_clean.csv",
        "../outputs/datapackage.json",
    ]
    for path in files_to_upload:
        filename = os.path.basename(path)
        with open(path, "rb") as fp:
            up = requests.put(f"{bucket_url}/{filename}", data=fp, headers=headers)
        up.raise_for_status()
        print("Uploaded:", filename)
Uploaded: field_samples_clean.csv
Uploaded: datapackage.json

Step 3 - add metadata

We describe the deposit (title, type, description, creators, licence, keywords). These mirror the metadata already in our datapackage.json.

import json

if HAVE_TOKEN:
    metadata = {
        "metadata": {
            "title": "Penguin Field Samples 2026 (training deposit)",
            "upload_type": "dataset",
            "description": (
                "Cleaned and validated biological field-sample measurements, packaged as a "
                "Frictionless Data Package. Created during the EOSC SSDS 2026 training. "
                "Test deposit on the Zenodo Sandbox."
            ),
            "creators": [{"name": "Data Steward", "affiliation": "EOSC SSDS 2026"}],
            "access_right": "open",
            "license": "cc-by-4.0",
            "keywords": ["penguins", "biology", "field samples", "FAIR", "training"],
        }
    }
    meta_headers = {"Content-Type": "application/json",
                    "Authorization": f"Bearer {ZENODO_TOKEN}"}
    r = requests.put(f"{BASE}/deposit/depositions/{deposition_id}",
                     data=json.dumps(metadata), headers=meta_headers)
    r.raise_for_status()
    print("Metadata added. Review the draft in your browser before publishing:")
    print(r.json()["links"]["html"])
Metadata added. Review the draft in your browser before publishing:
https://sandbox.zenodo.org/deposit/508226

Step 4 - publish (deliberate, final step)

Publishing finalises the record and mints the DOI. On the sandbox this is safe and reversible-ish (records are test-only and periodically wiped), but treat it as a deliberate action: on the real Zenodo, a published record cannot simply be deleted.

Open the draft link from Step 3 first and check everything looks right. Then, if you want the DOI, run the cell below. Leave it un-run to stop at a reviewed draft.

if HAVE_TOKEN:
    # set confirm to True only when you have reviewed the draft and want to publish
    confirm = True
    if confirm:
        r = requests.post(f"{BASE}/deposit/depositions/{deposition_id}/actions/publish",
                          headers=headers)
        r.raise_for_status()
        published = r.json()
        print("Published! Your test DOI:", published["doi"])
        print("Record:", published["links"]["record_html"])
    else:
        print("Not published - set confirm = True to publish once you have reviewed the draft.")
Published! Your test DOI: 10.5072/zenodo.508226
Record: https://sandbox.zenodo.org/record/508226

That’s the full pipeline

Across both sessions you took a raw, messy file and turned it into a validated, cleaned, documented, FAIR data package with a DMP and an optional DOI - using the AI assistant at every step, while always reading and verifying what it produced.

NotebookWhat you built
1Oriented in the toolbelt; first look at the raw data
2A pandera schema (data contract) that found every rule violation
3A quality report and cleaned data; validated the fixes
4A documented frictionless Data Package
5A DMP draft and an (optional) Zenodo deposit

That is the everyday work of a data steward, made faster and more reliable with the right Python tools and a well-directed assistant.