{ "cells": [ { "cell_type": "markdown", "id": "d826b3c4-70a7-45f4-95b3-409c078058c6", "metadata": {}, "source": [ "# Uploading training data to the Huggingface Hub\n", "It is good scientific practice to share training data, e.g. for fine-tuning language models, so that others can reproduce the process. In this notebook we demonstrate how we load a dataset in a custom format, convert it to a Huggingface compatible format and upload it to the hub. Note: You need to configure a `HF_TOKEN` environment variable with write access. If you do not have such a token yet, you can get it [here](https://huggingface.co/settings/tokens)." ] }, { "cell_type": "code", "execution_count": 1, "id": "4d43db94-c1e8-4ad4-a9aa-210db2dae183", "metadata": {}, "outputs": [], "source": [ "from datasets import Dataset\n", "import pandas as pd\n", "import json" ] }, { "cell_type": "markdown", "id": "6183cf09-4283-470b-aac7-70c9e0891769", "metadata": {}, "source": [ "First, we load our data in jsonl format. This dataset was derived from the [Bio-image Analysis Notebooks](https://haesleinhuepf.github.io/BioImageAnalysisNotebooks/intro.html) which are licensed [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/deed.en) by Robert Haase, Guillaume Witz, Miguel Fernandes, Marcelo Leomil Zoccoler, Shannon Taylor, Mara Lampert and Till Korten. The collection was processed to the questions and answers given here using OpenAI's GPT 3.5." ] }, { "cell_type": "code", "execution_count": 2, "id": "e252d881-ed03-44f1-9dd4-c8df052baba3", "metadata": {}, "outputs": [], "source": [ "qa_jsonl_filename = \"questions_answers.jsonl\"" ] }, { "cell_type": "code", "execution_count": 3, "id": "498b9d02-6344-40a7-9914-c6c618cff738", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | question | \n", "answer | \n", "
---|---|---|
0 | \n", "How can we calculate the average values along ... | \n", "\\nThis code imports the numpy library and crea... | \n", "
1 | \n", "How can I write Python code to apply statistic... | \n", "\\nThe code uses the numpy library in Python, w... | \n", "
2 | \n", "How can we obtain the precise shape (dimension... | \n", "\\nThis code reads an image file called \"blobs.... | \n", "
3 | \n", "How can we use indices in Python to crop image... | \n", "\\nThis code imports the necessary functions fr... | \n", "
4 | \n", "How can we write Python code to crop an image ... | \n", "\\nThe code imports functions `imshow` and `imr... | \n", "
\n", " | question | \n", "answer | \n", "
---|---|---|
0 | \n", "How can we calculate the average values along ... | \n", "\\nThis code imports the numpy library and crea... | \n", "
1 | \n", "How can I write Python code to apply statistic... | \n", "\\nThe code uses the numpy library in Python, w... | \n", "
2 | \n", "How can we obtain the precise shape (dimension... | \n", "\\nThis code reads an image file called \"blobs.... | \n", "
3 | \n", "How can we use indices in Python to crop image... | \n", "\\nThis code imports the necessary functions fr... | \n", "
4 | \n", "How can we write Python code to crop an image ... | \n", "\\nThe code imports functions `imshow` and `imr... | \n", "
... | \n", "... | \n", "... | \n", "
125 | \n", "How can we use Python code to visualize our `l... | \n", "\\nThe code uses the `curtain` function from th... | \n", "
126 | \n", "How can we open an image and label objects in ... | \n", "\\nThis code imports the necessary libraries an... | \n", "
127 | \n", "How can we use Python to analyze the labeled e... | \n", "\\nThe code uses the skimage library's measure ... | \n", "
128 | \n", "What Python code can be used to create a label... | \n", "\\nThis code imports necessary libraries and fu... | \n", "
129 | \n", "Can you provide a Python code for creating nea... | \n", "\\nThis code uses the pyclesperanto_prototype l... | \n", "
130 rows × 2 columns
\n", "