Skip to article frontmatterSkip to article content

Optional: Programming DMP Generation

Authors
Affiliations
Leipzig University
Helmholtz Center for Environmental Research - UFZ

In this notebook I will use the Blablador API to turn a fictive project description and a skeleton for a Data Management Plan (DMP) into a project-specific DMP. If you want to rerun this notebook, you need a Blablador API key, and store it as BLABLADOR_API_KEY in your environment variables. Also make sure to execute this notebook in an environment where the openai python library is installed, e.g. using pip install openai.

import openai
from IPython.display import display, Markdown

We define some helper-function to send a prompt to blablador and retrieve the result. (source)

def prompt(message:str, model="1 - Llama3 405 on WestAI with 4b quantization"):
    """A prompt helper function that sends a message to Blablador (FZ Jülich)
    and returns only the text response.
    """
    import os
    import openai
    
    # setup connection to the LLM
    client = openai.OpenAI()
    client.base_url = "https://helmholtz-blablador.fz-juelich.de:8000/v1"
    client.api_key = os.environ.get('BLABLADOR_API_KEY')
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": message}]
    )
    
    # extract answer
    return response.choices[0].message.content

Asking chatGPT about DMPs

result = prompt("""
Give me a short list of typical sections of a Data Management Plan. 
Write bullet points in markdown format and no detailed explanation.
""")

display(Markdown(result))
Loading...
result = prompt("""
What is commonly described in a section about "Backup and Archiving" in a 
Data Management Plan? Answer in 3 sentences.
""")

display(Markdown(result))
Loading...

Our project description

In the following cell you find a description of a fictive project. It contains all aspects of such a project that came to my mind when I though of the aspects chatGPT mentioned above. It is structured chronologously, listing things that happen early in the project first, and transitioning towards publication of a manuscript, code and data.

project_description = """
In our project we investigate the underlying physical principles for Gastrulation 
in Tribolium castaneum embryo development. Therefore, we use light-sheet microscopes
to acquire 3D timelapse imaging data. We store this data in the NGFF file format. 
After acquistion, two scientists, typically a PhD student and a post-doc or 
group leader look into the data together and decide if the dataset will be analyzed 
in detail. In case yes, we upload the data to an Omero-Server, a research data 
management solution specifically developed for microscopy imaging data. Data on 
this server is automatically backed-up by the compute center of our university. We then login 
to the Jupyter Lab server of the institute where we analyze the data. Analysis results
are also stored in the Omero-Server next to the imaging data results belong to. The
Python analysis code we write is stored in the institutional git-server. Also this 
server is backed up by the compute center. When the project advances, we start writing
a manuscipt using overleaf, an online service for collaborative manuscipt editing 
based on latex files. After every writing session, we save back the changed manuscript 
to the institutional git server. As soon as the manuscript is finished and 
submitted to the bioRxiv, a preprint server in the life-sciences, we also publish the 
project-related code by marking the project on the git-server as public. We also
tag the code with a release version. At the same time we publish the imaging data 
by submitting a copy of the dataset from the Omero-Server to zenodo.org, a 
community-driven repository for research data funded by the European Union. Another 
copy of the data, the code and the manuscript is stored on the institutional archive 
server. This server, maintained by the compute center, garantees to archive data for 
15 years. Documents and data we published is licensed under CC-BY 4.0 license. The code 
we publish is licensed BSD3. The entire project and all steps of the data life-cycle 
are documented in an institutional labnotebook where every user has to pay 10 Euro 
per month. Four people will work on the project. The compute center estimates the 
costs for storage and maintenance of the infrastructure to 20k Euro and half a 
position of an IT specialist. The project duration is four years.
"""

We can then use this project description as part of a prompt to chatGPT to turn this unstructured text into a DMP.

result = prompt(f"""
You are a professional grant proposal writer. In the following comes a description of 
a common project in our "Tribolium Development" Research Group at the University. 
Your task is to reformulate this project description into a Data Management Plan.

{project_description}
""")

display(Markdown(result))
Loading...

Combining information and structure

We next modify the prompt to also add information about the structure we need. This structure may be different from funding agency to funding agency and thus, this step is crucial in customizing the DMP accoring to given formal requirements.

result = prompt(f"""
You are a professional grant proposal writer. In the following comes a description of 
a common project in our "Tribolium Development" Research Group at the University. 
Your task is to reformulate this project description into a Data Management Plan.

{project_description}

The required structure for the data management plan, we need to write is like this:

# Data Management Plan
## Data description
## Documentation and data quality
## Storage and technical archiving the project
## Legal obligations and conditions 
## Data exchange and long-term data accessibility
## Responsibilities and resources

Use Markdown for headlines and text style.
""")

display(Markdown(result))
Loading...