{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "d826b3c4-70a7-45f4-95b3-409c078058c6",
   "metadata": {},
   "source": [
    "# Uploading training data to the Huggingface Hub\n",
    "It is good scientific practice to share training data, e.g. for fine-tuning language models, so that others can reproduce the process. In this notebook we demonstrate how we load a dataset in a custom format, convert it to a Huggingface compatible format and upload it to the hub. Note: You need to configure a `HF_TOKEN` environment variable with write access. If you do not have such a token yet, you can get it [here](https://huggingface.co/settings/tokens)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "4d43db94-c1e8-4ad4-a9aa-210db2dae183",
   "metadata": {},
   "outputs": [],
   "source": [
    "from datasets import Dataset\n",
    "import pandas as pd\n",
    "import json"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6183cf09-4283-470b-aac7-70c9e0891769",
   "metadata": {},
   "source": [
    "First, we load our data in jsonl format. This dataset was derived from the [Bio-image Analysis Notebooks](https://haesleinhuepf.github.io/BioImageAnalysisNotebooks/intro.html) which are licensed [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/deed.en) by Robert Haase, Guillaume Witz, Miguel Fernandes, Marcelo Leomil Zoccoler, Shannon Taylor, Mara Lampert and Till Korten. The collection was processed to the questions and answers given here using OpenAI's GPT 3.5."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "e252d881-ed03-44f1-9dd4-c8df052baba3",
   "metadata": {},
   "outputs": [],
   "source": [
    "qa_jsonl_filename = \"questions_answers.jsonl\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "498b9d02-6344-40a7-9914-c6c618cff738",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>question</th>\n",
       "      <th>answer</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>How can we calculate the average values along ...</td>\n",
       "      <td>\\nThis code imports the numpy library and crea...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>How can I write Python code to apply statistic...</td>\n",
       "      <td>\\nThe code uses the numpy library in Python, w...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>How can we obtain the precise shape (dimension...</td>\n",
       "      <td>\\nThis code reads an image file called \"blobs....</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>How can we use indices in Python to crop image...</td>\n",
       "      <td>\\nThis code imports the necessary functions fr...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>How can we write Python code to crop an image ...</td>\n",
       "      <td>\\nThe code imports functions `imshow` and `imr...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                            question  \\\n",
       "0  How can we calculate the average values along ...   \n",
       "1  How can I write Python code to apply statistic...   \n",
       "2  How can we obtain the precise shape (dimension...   \n",
       "3  How can we use indices in Python to crop image...   \n",
       "4  How can we write Python code to crop an image ...   \n",
       "\n",
       "                                              answer  \n",
       "0  \\nThis code imports the numpy library and crea...  \n",
       "1  \\nThe code uses the numpy library in Python, w...  \n",
       "2  \\nThis code reads an image file called \"blobs....  \n",
       "3  \\nThis code imports the necessary functions fr...  \n",
       "4  \\nThe code imports functions `imshow` and `imr...  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = []\n",
    "with open(qa_jsonl_filename, 'r') as file:\n",
    "    for line in file:\n",
    "        json_object = json.loads(line.strip())\n",
    "        data.append(json_object)\n",
    "# Convert the array into a pandas DataFrame\n",
    "df = pd.DataFrame(data)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c68038cf-6a5f-4198-a42f-a94cf438d06f",
   "metadata": {},
   "source": [
    "We convert this to a Huggingface dataset using the [datasets library](https://github.com/huggingface/datasets)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "48584b16-4db6-4e45-99e5-75097141ba06",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Dataset({\n",
       "    features: ['question', 'answer'],\n",
       "    num_rows: 130\n",
       "})"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Create a Hugging Face dataset from the DataFrame\n",
    "dataset = Dataset.from_pandas(df)\n",
    "dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1058a1a9-e975-40fa-b88e-f5ae2081f86a",
   "metadata": {},
   "source": [
    "Next, we can upload this dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "51a28a15-2a1a-41e8-a8b4-362a11034e5e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "2cf4bc38bd22400ab7932e66d80da43f",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "a766658cc0b042aeba1c3cbe3e840056",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "CommitInfo(commit_url='https://huggingface.co/datasets/haesleinhuepf/bio-image-analysis-qa/commit/628f86b20659a224dc569b20215686597f58bb83', commit_message='Upload dataset', commit_description='', oid='628f86b20659a224dc569b20215686597f58bb83', pr_url=None, pr_revision=None, pr_num=None)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset.push_to_hub(\"haesleinhuepf/bio-image-analysis-qa\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "02c01a82-e4aa-43c7-9fbf-40ccc0b9b524",
   "metadata": {},
   "source": [
    "Note: It is recommended to specify details, in particular data sources, on the Huggingface hub. You can do this in the graphical user interface of the website of your model. You can click the link in the message above to go to the model page."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c627d1af-4c14-48f1-822a-1708456b28cf",
   "metadata": {},
   "source": [
    "## Downloading the data\n",
    "In the future, you and others can download this data easily like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "68406b44-19d5-47bf-980e-a7d2909e4e37",
   "metadata": {},
   "outputs": [],
   "source": [
    "from datasets import load_dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "fa8356ea-804f-4d9e-9730-e10ffc255ac9",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Dataset({\n",
       "    features: ['question', 'answer'],\n",
       "    num_rows: 130\n",
       "})"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset2_name = \"haesleinhuepf/bio-image-analysis-qa\"\n",
    "dataset2 = load_dataset(dataset2_name, split=\"all\")\n",
    "dataset2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "6bc10573-2d12-4842-97ea-b497e3784374",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>question</th>\n",
       "      <th>answer</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>How can we calculate the average values along ...</td>\n",
       "      <td>\\nThis code imports the numpy library and crea...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>How can I write Python code to apply statistic...</td>\n",
       "      <td>\\nThe code uses the numpy library in Python, w...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>How can we obtain the precise shape (dimension...</td>\n",
       "      <td>\\nThis code reads an image file called \"blobs....</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>How can we use indices in Python to crop image...</td>\n",
       "      <td>\\nThis code imports the necessary functions fr...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>How can we write Python code to crop an image ...</td>\n",
       "      <td>\\nThe code imports functions `imshow` and `imr...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>125</th>\n",
       "      <td>How can we use Python code to visualize our `l...</td>\n",
       "      <td>\\nThe code uses the `curtain` function from th...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>126</th>\n",
       "      <td>How can we open an image and label objects in ...</td>\n",
       "      <td>\\nThis code imports the necessary libraries an...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>127</th>\n",
       "      <td>How can we use Python to analyze the labeled e...</td>\n",
       "      <td>\\nThe code uses the skimage library's measure ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>128</th>\n",
       "      <td>What Python code can be used to create a label...</td>\n",
       "      <td>\\nThis code imports necessary libraries and fu...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>129</th>\n",
       "      <td>Can you provide a Python code for creating nea...</td>\n",
       "      <td>\\nThis code uses the pyclesperanto_prototype l...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>130 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                              question  \\\n",
       "0    How can we calculate the average values along ...   \n",
       "1    How can I write Python code to apply statistic...   \n",
       "2    How can we obtain the precise shape (dimension...   \n",
       "3    How can we use indices in Python to crop image...   \n",
       "4    How can we write Python code to crop an image ...   \n",
       "..                                                 ...   \n",
       "125  How can we use Python code to visualize our `l...   \n",
       "126  How can we open an image and label objects in ...   \n",
       "127  How can we use Python to analyze the labeled e...   \n",
       "128  What Python code can be used to create a label...   \n",
       "129  Can you provide a Python code for creating nea...   \n",
       "\n",
       "                                                answer  \n",
       "0    \\nThis code imports the numpy library and crea...  \n",
       "1    \\nThe code uses the numpy library in Python, w...  \n",
       "2    \\nThis code reads an image file called \"blobs....  \n",
       "3    \\nThis code imports the necessary functions fr...  \n",
       "4    \\nThe code imports functions `imshow` and `imr...  \n",
       "..                                                 ...  \n",
       "125  \\nThe code uses the `curtain` function from th...  \n",
       "126  \\nThis code imports the necessary libraries an...  \n",
       "127  \\nThe code uses the skimage library's measure ...  \n",
       "128  \\nThis code imports necessary libraries and fu...  \n",
       "129  \\nThis code uses the pyclesperanto_prototype l...  \n",
       "\n",
       "[130 rows x 2 columns]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset2.to_pandas()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "24ff7579-c67d-4242-8b72-4f41a71c9b4b",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}