Optional: Programming DMP Generation
In this notebook I will use the Blablador API to turn a fictive project description and a skeleton for a Data Management Plan (DMP) into a project-specific DMP. If you want to rerun this notebook, you need a Blablador API key, and store it as BLABLADOR_API_KEY
in your environment variables. Also make sure to execute this notebook in an environment where the openai python library is installed, e.g. using pip install openai
.
import openai
from IPython.display import display, Markdown
We define some helper-function to send a prompt to blablador and retrieve the result. (source)
def prompt(message:str, model="1 - Llama3 405 on WestAI with 4b quantization"):
"""A prompt helper function that sends a message to Blablador (FZ Jülich)
and returns only the text response.
"""
import os
import openai
# setup connection to the LLM
client = openai.OpenAI()
client.base_url = "https://helmholtz-blablador.fz-juelich.de:8000/v1"
client.api_key = os.environ.get('BLABLADOR_API_KEY')
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": message}]
)
# extract answer
return response.choices[0].message.content
Asking chatGPT about DMPs¶
result = prompt("""
Give me a short list of typical sections of a Data Management Plan.
Write bullet points in markdown format and no detailed explanation.
""")
display(Markdown(result))
result = prompt("""
What is commonly described in a section about "Backup and Archiving" in a
Data Management Plan? Answer in 3 sentences.
""")
display(Markdown(result))
Our project description¶
In the following cell you find a description of a fictive project. It contains all aspects of such a project that came to my mind when I though of the aspects chatGPT mentioned above. It is structured chronologously, listing things that happen early in the project first, and transitioning towards publication of a manuscript, code and data.
project_description = """
In our project we investigate the underlying physical principles for Gastrulation
in Tribolium castaneum embryo development. Therefore, we use light-sheet microscopes
to acquire 3D timelapse imaging data. We store this data in the NGFF file format.
After acquistion, two scientists, typically a PhD student and a post-doc or
group leader look into the data together and decide if the dataset will be analyzed
in detail. In case yes, we upload the data to an Omero-Server, a research data
management solution specifically developed for microscopy imaging data. Data on
this server is automatically backed-up by the compute center of our university. We then login
to the Jupyter Lab server of the institute where we analyze the data. Analysis results
are also stored in the Omero-Server next to the imaging data results belong to. The
Python analysis code we write is stored in the institutional git-server. Also this
server is backed up by the compute center. When the project advances, we start writing
a manuscipt using overleaf, an online service for collaborative manuscipt editing
based on latex files. After every writing session, we save back the changed manuscript
to the institutional git server. As soon as the manuscript is finished and
submitted to the bioRxiv, a preprint server in the life-sciences, we also publish the
project-related code by marking the project on the git-server as public. We also
tag the code with a release version. At the same time we publish the imaging data
by submitting a copy of the dataset from the Omero-Server to zenodo.org, a
community-driven repository for research data funded by the European Union. Another
copy of the data, the code and the manuscript is stored on the institutional archive
server. This server, maintained by the compute center, garantees to archive data for
15 years. Documents and data we published is licensed under CC-BY 4.0 license. The code
we publish is licensed BSD3. The entire project and all steps of the data life-cycle
are documented in an institutional labnotebook where every user has to pay 10 Euro
per month. Four people will work on the project. The compute center estimates the
costs for storage and maintenance of the infrastructure to 20k Euro and half a
position of an IT specialist. The project duration is four years.
"""
We can then use this project description as part of a prompt to chatGPT to turn this unstructured text into a DMP.
result = prompt(f"""
You are a professional grant proposal writer. In the following comes a description of
a common project in our "Tribolium Development" Research Group at the University.
Your task is to reformulate this project description into a Data Management Plan.
{project_description}
""")
display(Markdown(result))
Combining information and structure¶
We next modify the prompt to also add information about the structure we need. This structure may be different from funding agency to funding agency and thus, this step is crucial in customizing the DMP accoring to given formal requirements.
result = prompt(f"""
You are a professional grant proposal writer. In the following comes a description of
a common project in our "Tribolium Development" Research Group at the University.
Your task is to reformulate this project description into a Data Management Plan.
{project_description}
The required structure for the data management plan, we need to write is like this:
# Data Management Plan
## Data description
## Documentation and data quality
## Storage and technical archiving the project
## Legal obligations and conditions
## Data exchange and long-term data accessibility
## Responsibilities and resources
Use Markdown for headlines and text style.
""")
display(Markdown(result))