GPT4-omni VLM#

In this notebook we will use the vision language model GPT4 Omni to inspect an image.

import openai
from skimage.io import imread
import stackview
from image_utilities import numpy_to_bytestream, extract_json
import base64
from stackview._image_widget import _img_to_rgb
import json

Example image#

First we load a medical tomography image.

mri = imread("data/Haase_MRT_tfl3d1.tif")[100]
stackview.insight(mri)
shape(256, 256)
dtypeuint8
size64.0 kB
min0
max255

We will now send the image to ChatGPT and ask it the same questions.

def prompt_chatGPT(prompt:str, image, model="gpt-4o"):
    """A prompt helper function that sends a message to openAI
    and returns only the text response.
    """
    rgb_image = _img_to_rgb(image)
    byte_stream = numpy_to_bytestream(rgb_image)
    base64_image = base64.b64encode(byte_stream).decode('utf-8')

    message = [{"role": "user", "content": [
        {"type": "text", "text": prompt},
        {
        "type": "image_url",
        "image_url": {
            "url": f"data:image/jpeg;base64,{base64_image}"
        }
    }]}]
            
    # setup connection to the LLM
    client = openai.OpenAI()
    
    # submit prompt
    response = client.chat.completions.create(
        model=model,
        messages=message
    )
    
    # extract answer
    return response.choices[0].message.content
prompt_chatGPT("what's in this image?", mri, model="gpt-4o")
'This image shows a sagittal MRI scan of a human head. You can see the brain, spinal cord, and other anatomical features such as the nasal passages and the structure of the skull.'
cat_image = imread("data/real_cat.png")

reply = prompt_chatGPT("""
Give me a json object of a bounding boxes around the cat in this image. 
The format should be like this: {'x':int,'y':int,'width':int,'height':int}
""", cat_image)
print(reply)
bb = json.loads(extract_json(reply))

stackview.add_bounding_boxes(cat_image, [bb])
```json
{
  "x": 150,
  "y": 50,
  "width": 140,
  "height": 200
}
```
shape(512, 512, 3)
dtypeuint8
size768.0 kB
min0
max255

Exercise#

Use a vision-language model to determine the content of an image, e.g. membrane2d.tif. Ask the model to differentiate these cases:

  • An image with bright blob-like structures

  • An image with membrane-like structures such as lines or meshes

Make sure the model response with the case only and no detailed explanation.