{ "cells": [ { "cell_type": "markdown", "id": "d826b3c4-70a7-45f4-95b3-409c078058c6", "metadata": {}, "source": [ "# Uploading training data to the Huggingface Hub\n", "It is good scientific practice to share training data, e.g. for fine-tuning language models, so that others can reproduce the process. In this notebook we demonstrate how we load a dataset in a custom format, convert it to a Huggingface compatible format and upload it to the hub. Note: You need to configure a `HF_TOKEN` environment variable with write access. If you do not have such a token yet, you can get it [here](https://huggingface.co/settings/tokens)." ] }, { "cell_type": "code", "execution_count": 1, "id": "4d43db94-c1e8-4ad4-a9aa-210db2dae183", "metadata": {}, "outputs": [], "source": [ "from datasets import Dataset\n", "import pandas as pd\n", "import json" ] }, { "cell_type": "markdown", "id": "6183cf09-4283-470b-aac7-70c9e0891769", "metadata": {}, "source": [ "First, we load our data in jsonl format. This dataset was derived from the [Bio-image Analysis Notebooks](https://haesleinhuepf.github.io/BioImageAnalysisNotebooks/intro.html) which are licensed [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/deed.en) by Robert Haase, Guillaume Witz, Miguel Fernandes, Marcelo Leomil Zoccoler, Shannon Taylor, Mara Lampert and Till Korten. The collection was processed to the questions and answers given here using OpenAI's GPT 3.5." ] }, { "cell_type": "code", "execution_count": 2, "id": "e252d881-ed03-44f1-9dd4-c8df052baba3", "metadata": {}, "outputs": [], "source": [ "qa_jsonl_filename = \"questions_answers.jsonl\"" ] }, { "cell_type": "code", "execution_count": 3, "id": "498b9d02-6344-40a7-9914-c6c618cff738", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
questionanswer
0How can we calculate the average values along ...\\nThis code imports the numpy library and crea...
1How can I write Python code to apply statistic...\\nThe code uses the numpy library in Python, w...
2How can we obtain the precise shape (dimension...\\nThis code reads an image file called \"blobs....
3How can we use indices in Python to crop image...\\nThis code imports the necessary functions fr...
4How can we write Python code to crop an image ...\\nThe code imports functions `imshow` and `imr...
\n", "
" ], "text/plain": [ " question \\\n", "0 How can we calculate the average values along ... \n", "1 How can I write Python code to apply statistic... \n", "2 How can we obtain the precise shape (dimension... \n", "3 How can we use indices in Python to crop image... \n", "4 How can we write Python code to crop an image ... \n", "\n", " answer \n", "0 \\nThis code imports the numpy library and crea... \n", "1 \\nThe code uses the numpy library in Python, w... \n", "2 \\nThis code reads an image file called \"blobs.... \n", "3 \\nThis code imports the necessary functions fr... \n", "4 \\nThe code imports functions `imshow` and `imr... " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = []\n", "with open(qa_jsonl_filename, 'r') as file:\n", " for line in file:\n", " json_object = json.loads(line.strip())\n", " data.append(json_object)\n", "# Convert the array into a pandas DataFrame\n", "df = pd.DataFrame(data)\n", "df.head()" ] }, { "cell_type": "markdown", "id": "c68038cf-6a5f-4198-a42f-a94cf438d06f", "metadata": {}, "source": [ "We convert this to a Huggingface dataset using the [datasets library](https://github.com/huggingface/datasets)." ] }, { "cell_type": "code", "execution_count": 4, "id": "48584b16-4db6-4e45-99e5-75097141ba06", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Dataset({\n", " features: ['question', 'answer'],\n", " num_rows: 130\n", "})" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Create a Hugging Face dataset from the DataFrame\n", "dataset = Dataset.from_pandas(df)\n", "dataset" ] }, { "cell_type": "markdown", "id": "1058a1a9-e975-40fa-b88e-f5ae2081f86a", "metadata": {}, "source": [ "Next, we can upload this dataset." ] }, { "cell_type": "code", "execution_count": 5, "id": "51a28a15-2a1a-41e8-a8b4-362a11034e5e", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "2cf4bc38bd22400ab7932e66d80da43f", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Uploading the dataset shards: 0%| | 0/1 [00:00\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
questionanswer
0How can we calculate the average values along ...\\nThis code imports the numpy library and crea...
1How can I write Python code to apply statistic...\\nThe code uses the numpy library in Python, w...
2How can we obtain the precise shape (dimension...\\nThis code reads an image file called \"blobs....
3How can we use indices in Python to crop image...\\nThis code imports the necessary functions fr...
4How can we write Python code to crop an image ...\\nThe code imports functions `imshow` and `imr...
.........
125How can we use Python code to visualize our `l...\\nThe code uses the `curtain` function from th...
126How can we open an image and label objects in ...\\nThis code imports the necessary libraries an...
127How can we use Python to analyze the labeled e...\\nThe code uses the skimage library's measure ...
128What Python code can be used to create a label...\\nThis code imports necessary libraries and fu...
129Can you provide a Python code for creating nea...\\nThis code uses the pyclesperanto_prototype l...
\n", "

130 rows × 2 columns

\n", "" ], "text/plain": [ " question \\\n", "0 How can we calculate the average values along ... \n", "1 How can I write Python code to apply statistic... \n", "2 How can we obtain the precise shape (dimension... \n", "3 How can we use indices in Python to crop image... \n", "4 How can we write Python code to crop an image ... \n", ".. ... \n", "125 How can we use Python code to visualize our `l... \n", "126 How can we open an image and label objects in ... \n", "127 How can we use Python to analyze the labeled e... \n", "128 What Python code can be used to create a label... \n", "129 Can you provide a Python code for creating nea... \n", "\n", " answer \n", "0 \\nThis code imports the numpy library and crea... \n", "1 \\nThe code uses the numpy library in Python, w... \n", "2 \\nThis code reads an image file called \"blobs.... \n", "3 \\nThis code imports the necessary functions fr... \n", "4 \\nThe code imports functions `imshow` and `imr... \n", ".. ... \n", "125 \\nThe code uses the `curtain` function from th... \n", "126 \\nThis code imports the necessary libraries an... \n", "127 \\nThe code uses the skimage library's measure ... \n", "128 \\nThis code imports necessary libraries and fu... \n", "129 \\nThis code uses the pyclesperanto_prototype l... \n", "\n", "[130 rows x 2 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset2.to_pandas()" ] }, { "cell_type": "code", "execution_count": null, "id": "24ff7579-c67d-4242-8b72-4f41a71c9b4b", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.9" } }, "nbformat": 4, "nbformat_minor": 5 }