Fine-tuning an OpenAI GPT from questions and answers#

In this notebook we take a text file in the following format to fine-tune a GPT-based language model using OpenAI’s infrastructure.

Question:

How can I open CZI or LIF files using Python?

Answer:

To open CZI or LIF files, you can use the AICSImageIO package. 
In the following code the file `filename` will be loaded and 
the image data will be stored in `image`.

```python
from aicsimageio import AICSImage
aics_image = AICSImage("../../data/EM_C_6_c0.ome.tif")

np_image = aics_image.get_image_data("ZYX")
```

See also:

Todo: We could submit training and validation data separately. This notebook does not cover this yet due to limited training data. As soon as we have a larger pool of training data, we can give this a try.

from fine_tuning_utilities import load_jsonl_file, save_jsonl_file
import time
import os
import openai
import json

First, we need to convert the text document into jsonl format representing conversions between user and assistant as required by OpenAI.

qa_text_filename = "question_answers_generated.txt"

with open(qa_text_filename, "r") as file:
    lines = file.read()

# Convert text to list of dictionaries
training_data = []

blocks = lines.split("Question:")
for block in blocks:
    sub_blocks = block.split("Answer:")
    if len(sub_blocks) == 2:
        question = sub_blocks[0].strip().strip("\n").strip()
        answer =   sub_blocks[1].strip().strip("\n").strip()
        
        training_data.append(
            {
                "messages": [
                    # {"role": "system", "content": """Enter a smart system message here."""},
                    {"role": "user", "content": question},
                    {"role": "assistant", "content": answer}
                ]
            })

training_data[:2]
[{'messages': [{'role': 'user',
    'content': 'How can I display an image with a slider and label showing mouse position and intensity?'},
   {'role': 'assistant',
    'content': 'To display an image with a slider and label showing mouse position and intensity, you can use the following code:\n```python\nstackview.annotate(image, labels)\n```'}]},
 {'messages': [{'role': 'user',
    'content': 'How can I allow cropping an image along all axes?'},
   {'role': 'assistant',
    'content': 'You can crop an image along all axes using the following function:\n```python\nstackview.crop(image)\n```'}]}]
# save training data to a temporary file
training_data_file_path = "training_data.jsonl"
save_jsonl_file(training_data, training_data_file_path)
client = openai.OpenAI()

# upload and preprocess file
training_file = client.files.create(
    file=open(training_data_file_path, "rb"),
    purpose='fine-tune',
)

# wait until preprocessing is finished
while client.files.retrieve(training_file.id).status != "processed":
    time.sleep(30)

print("Uploading / preprocessing done.")
Uploading / preprocessing done.
# start fine-tuning
fine_tuning_job = client.fine_tuning.jobs.create(
                        training_file=training_file.id, 
                        model="gpt-3.5-turbo")
fine_tuning_job
FineTuningJob(id='ftjob-nSXw9q94peyfTaBatGGahuZN', created_at=1717680034, error=Error(code=None, message=None, param=None), fine_tuned_model=None, finished_at=None, hyperparameters=Hyperparameters(n_epochs='auto', batch_size='auto', learning_rate_multiplier='auto'), model='gpt-3.5-turbo-0125', object='fine_tuning.job', organization_id='org-0POmhzyaeDng5lZtM7Cls3vt', result_files=[], seed=485481089, status='validating_files', trained_tokens=None, training_file='file-v6hGXPsQ2JYLRzgcmWPZvGke', validation_file=None, estimated_finish=None, integrations=[], user_provided_suffix=None)
job_details = client.fine_tuning.jobs.retrieve(
                        fine_tuning_job.id)
job_details
FineTuningJob(id='ftjob-nSXw9q94peyfTaBatGGahuZN', created_at=1717680034, error=Error(code=None, message=None, param=None), fine_tuned_model=None, finished_at=None, hyperparameters=Hyperparameters(n_epochs='auto', batch_size='auto', learning_rate_multiplier='auto'), model='gpt-3.5-turbo-0125', object='fine_tuning.job', organization_id='org-0POmhzyaeDng5lZtM7Cls3vt', result_files=[], seed=485481089, status='validating_files', trained_tokens=None, training_file='file-v6hGXPsQ2JYLRzgcmWPZvGke', validation_file=None, estimated_finish=None, integrations=[], user_provided_suffix=None)
job_details.status
'validating_files'
job_details = client.fine_tuning.jobs.retrieve(fine_tuning_job.id)
job_details.status
'validating_files'
job_details = client.fine_tuning.jobs.retrieve(fine_tuning_job.id)
job_details.error
Error(code=None, message=None, param=None)

In case you don’t want to run the cell above repeatedly manually, one can also run such a request in a loop:

while client.fine_tuning.jobs.retrieve(fine_tuning_job.id).status not in ["succeeded", "failed"]:
    time.sleep(120)

job_details = client.fine_tuning.jobs.retrieve(
                fine_tuning_job.id)
job_details.status 
'succeeded'

Retrieving the new model name#

Once done, one can retrieve the name of the fine-tuned model like this:

model_name = job_details.fine_tuned_model
model_name
'ft:gpt-3.5-turbo-0125:leipzig-university::9X7PFVgP'