Uploading training data to the Huggingface Hub#

It is good scientific practice to share training data, e.g. for fine-tuning language models, so that others can reproduce the process. In this notebook we demonstrate how we load a dataset in a custom format, convert it to a Huggingface compatible format and upload it to the hub. Note: You need to configure a HF_TOKEN environment variable with write access. If you do not have such a token yet, you can get it here.

from datasets import Dataset
import pandas as pd
import json

First, we load our data in jsonl format. This dataset was derived from the Bio-image Analysis Notebooks which are licensed CC-BY 4.0 by Robert Haase, Guillaume Witz, Miguel Fernandes, Marcelo Leomil Zoccoler, Shannon Taylor, Mara Lampert and Till Korten. The collection was processed to the questions and answers given here using OpenAI’s GPT 3.5.

qa_jsonl_filename = "questions_answers.jsonl"
data = []
with open(qa_jsonl_filename, 'r') as file:
    for line in file:
        json_object = json.loads(line.strip())
        data.append(json_object)
# Convert the array into a pandas DataFrame
df = pd.DataFrame(data)
df.head()
question answer
0 How can we calculate the average values along ... \nThis code imports the numpy library and crea...
1 How can I write Python code to apply statistic... \nThe code uses the numpy library in Python, w...
2 How can we obtain the precise shape (dimension... \nThis code reads an image file called "blobs....
3 How can we use indices in Python to crop image... \nThis code imports the necessary functions fr...
4 How can we write Python code to crop an image ... \nThe code imports functions `imshow` and `imr...

We convert this to a Huggingface dataset using the datasets library.

# Create a Hugging Face dataset from the DataFrame
dataset = Dataset.from_pandas(df)
dataset
Dataset({
    features: ['question', 'answer'],
    num_rows: 130
})

Next, we can upload this dataset.

dataset.push_to_hub("haesleinhuepf/bio-image-analysis-qa")
CommitInfo(commit_url='https://huggingface.co/datasets/haesleinhuepf/bio-image-analysis-qa/commit/628f86b20659a224dc569b20215686597f58bb83', commit_message='Upload dataset', commit_description='', oid='628f86b20659a224dc569b20215686597f58bb83', pr_url=None, pr_revision=None, pr_num=None)

Note: It is recommended to specify details, in particular data sources, on the Huggingface hub. You can do this in the graphical user interface of the website of your model. You can click the link in the message above to go to the model page.

Downloading the data#

In the future, you and others can download this data easily like this:

from datasets import load_dataset
dataset2_name = "haesleinhuepf/bio-image-analysis-qa"
dataset2 = load_dataset(dataset2_name, split="all")
dataset2
Dataset({
    features: ['question', 'answer'],
    num_rows: 130
})
dataset2.to_pandas()
question answer
0 How can we calculate the average values along ... \nThis code imports the numpy library and crea...
1 How can I write Python code to apply statistic... \nThe code uses the numpy library in Python, w...
2 How can we obtain the precise shape (dimension... \nThis code reads an image file called "blobs....
3 How can we use indices in Python to crop image... \nThis code imports the necessary functions fr...
4 How can we write Python code to crop an image ... \nThe code imports functions `imshow` and `imr...
... ... ...
125 How can we use Python code to visualize our `l... \nThe code uses the `curtain` function from th...
126 How can we open an image and label objects in ... \nThis code imports the necessary libraries an...
127 How can we use Python to analyze the labeled e... \nThe code uses the skimage library's measure ...
128 What Python code can be used to create a label... \nThis code imports necessary libraries and fu...
129 Can you provide a Python code for creating nea... \nThis code uses the pyclesperanto_prototype l...

130 rows × 2 columns