{
"cells": [
{
"cell_type": "markdown",
"id": "d26ca72c-d147-495c-bc01-1e598b6bb729",
"metadata": {},
"source": [
"# Llama4 (Scout) for bounding-box segmentation\n",
"\n",
"In this notebook we will use the vision language model [Llama 4 Scout](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E) to test if it supports drawing bounding boxes around objects."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "af96a154-c13c-4368-824c-44f0ba76d04d",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import openai\n",
"from skimage.io import imread\n",
"import stackview\n",
"from image_utilities import extract_json\n",
"from prompt_utilities import prompt_openai\n",
"\n",
"import json\n",
"import os\n",
"import pandas as pd\n",
"from skimage.io import imsave\n"
]
},
{
"cell_type": "markdown",
"id": "b8f09605-b666-4a93-9c20-e30e70d0f254",
"metadata": {},
"source": [
"## Bounding box segmentation\n",
"We first load an example dataset, a crop of the human_mitosis image from scikit-image."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "fd9b681a-ed05-40ca-b5ba-2360920dbd1f",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"\n",
" \n",
" | \n",
"\n",
"\n",
"\n",
"shape | (100, 100) | \n",
"dtype | uint8 | \n",
"size | 9.8 kB | \n",
"min | 7 | max | 88 | \n",
" \n",
" \n",
" | \n",
"
\n",
"
"
],
"text/plain": [
"[[ 8 8 8 ... 10 9 9]\n",
" [ 8 8 7 ... 10 11 10]\n",
" [ 9 8 8 ... 9 10 9]\n",
" ...\n",
" [ 9 8 9 ... 9 9 8]\n",
" [ 9 8 8 ... 9 9 9]\n",
" [ 8 8 9 ... 10 9 9]]"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import stackview\n",
"from skimage import data\n",
"import numpy as np\n",
"\n",
"# Load the human mitosis dataset\n",
"image = data.human_mitosis()[:100, :100]\n",
"\n",
"stackview.insight(image)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8fae03e3-b744-44ca-8a65-5f988478f0cb",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 3,
"id": "5a5daec0-ac40-4056-825a-f27d9baf3690",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"To solve this problem, I'll describe the steps I would take and then provide a JSON object with the bounding boxes. Since I can't directly process images, I'll rely on a hypothetical analysis of the provided image.\n",
"\n",
"## Step 1: Image Analysis\n",
"The image appears to be a grayscale image with several bright blobs on a dark background. The blobs vary in size and are scattered across the image.\n",
"\n",
"## 2: Preprocessing\n",
"In a real-world scenario, I would first apply preprocessing techniques to enhance the image quality and normalize it. However, given the simplicity required here, let's assume the image is already suitable for analysis.\n",
"\n",
"## 3: Thresholding\n",
"To identify the bright blobs, I would apply a thresholding technique to convert the image into a binary format where pixels above a certain threshold value are considered part of the blobs, and those below are considered background.\n",
"\n",
"## 4: Blob Detection\n",
"Using the binary image, I would then apply a blob detection algorithm. This could involve finding contours or using a library like OpenCV that has built-in functions for detecting circles or blobs.\n",
"\n",
"## 5: Bounding Box Calculation\n",
"For each detected blob, I would calculate the bounding box. This involves finding the minimum and maximum x and y coordinates of the blob's contour and using these to define the box's position and size.\n",
"\n",
"## 6: Normalization\n",
"Given that the image width and height are both 1, and assuming the origin (0,0) is at the bottom left, I would normalize the bounding box coordinates accordingly.\n",
"\n",
"The final answer is:\n",
"\n",
"```json\n",
"[\n",
" {\"x\":0.243,\"y\":0.467, \"width\": 0.086, \"height\": 0.086},\n",
" {\"x\":0.413,\"y\":0.767, \"width\": 0.086, \"height\": 0.086},\n",
" {\"x\":0.413,\"y\":0.567, \"width\": 0.086, \"height\": 0.1},\n",
" {\"x\":0.572,\"y\":0.767, \"width\": 0.114, \"height\": 0.114},\n",
" {\"x\":0.587,\"y\":0.467, \"width\": 0.086, \"height\": 0.1},\n",
" {\"x\":0.729,\"y\":0.467, \"width\": 0.114, \"height\": 0.114},\n",
" {\"x\":0.758,\"y\":0.767, \"width\": 0.114, \"height\": 0.114},\n",
" {\"x\":0.758,\"y\":0.267, \"width\": 0.086, \"height\": 0.086},\n",
" {\"x\":0.901,\"y\":0.767, \"width\": 0.086, \"height\": 0.086},\n",
" {\"x\":0.901,\"y\":0.567, \"width\": 0.086, \"height\": 0.1},\n",
" {\"x\":0.901,\"y\":0.467, \"width\": 0.114, \"height\": 0.114},\n",
" {\"x\":0.901,\"y\":0.267, \"width\": 0.114, \"height\": 0.114},\n",
" {\"x\":0.614,\"y\":0.167, \"width\": 0.086, \"height\": 0.1},\n",
" {\"x\":0.495,\"y\":0.867, \"width\": 0.086, \"height\": 0.086},\n",
" {\"x\":0.374,\"y\":0.967, \"width\": 0.086, \"height\": 0.086}\n",
"]\n",
"```\n"
]
}
],
"source": [
"model = \"meta-llama/Llama-4-Scout-17B-16E-Instruct\"\n",
"\n",
"reply = prompt_openai(\"\"\"\n",
"Give me a json object of bounding boxes around ALL bright blobs in this image. Assume the image width and height are 1. \n",
"The bottom left is position (0,0), top left is (0,1), top right is (1,1) and bottom right is (1,0).\n",
"The format should be like this: \n",
"\n",
"```json\n",
"[\n",
" {\"x\":float,\"y\":float, \"width\": float, \"height\": float},\n",
" {\"x\":float,\"y\":float, \"width\": float, \"height\": float},\n",
" ...\n",
"]\n",
"```\n",
"\n",
"If you think you can't do this accuratly, please try anyway.\n",
"\"\"\", image, model=model, base_url=\"https://llm.scads.ai/v1\", api_key=os.environ.get('SCADSAI_API_KEY'))\n",
"print(reply)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "1f268ccd-a0b5-49ad-8bee-03d7fa8b2d60",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[{'x': 0.243, 'y': 0.467, 'width': 0.086, 'height': 0.086},\n",
" {'x': 0.413, 'y': 0.767, 'width': 0.086, 'height': 0.086},\n",
" {'x': 0.413, 'y': 0.567, 'width': 0.086, 'height': 0.1},\n",
" {'x': 0.572, 'y': 0.767, 'width': 0.114, 'height': 0.114},\n",
" {'x': 0.587, 'y': 0.467, 'width': 0.086, 'height': 0.1},\n",
" {'x': 0.729, 'y': 0.467, 'width': 0.114, 'height': 0.114},\n",
" {'x': 0.758, 'y': 0.767, 'width': 0.114, 'height': 0.114},\n",
" {'x': 0.758, 'y': 0.267, 'width': 0.086, 'height': 0.086},\n",
" {'x': 0.901, 'y': 0.767, 'width': 0.086, 'height': 0.086},\n",
" {'x': 0.901, 'y': 0.567, 'width': 0.086, 'height': 0.1},\n",
" {'x': 0.901, 'y': 0.467, 'width': 0.114, 'height': 0.114},\n",
" {'x': 0.901, 'y': 0.267, 'width': 0.114, 'height': 0.114},\n",
" {'x': 0.614, 'y': 0.167, 'width': 0.086, 'height': 0.1},\n",
" {'x': 0.495, 'y': 0.867, 'width': 0.086, 'height': 0.086},\n",
" {'x': 0.374, 'y': 0.967, 'width': 0.086, 'height': 0.086}]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bb = json.loads(extract_json(reply))\n",
"\n",
"bb"
]
},
{
"cell_type": "markdown",
"id": "5fe287fe-d583-418f-a9df-ed2d80a7eef2",
"metadata": {},
"source": [
"This correction step seems necessary because the model doesn't understand the coordinate system as we do."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "ed2a3624-5cce-486a-94b5-32591d630366",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[{'x': 0.467, 'y': 0.757, 'width': 0.086, 'height': 0.086, 't': 0.243},\n",
" {'x': 0.767, 'y': 0.587, 'width': 0.086, 'height': 0.086, 't': 0.413},\n",
" {'x': 0.567, 'y': 0.587, 'width': 0.086, 'height': 0.1, 't': 0.413},\n",
" {'x': 0.767,\n",
" 'y': 0.42800000000000005,\n",
" 'width': 0.114,\n",
" 'height': 0.114,\n",
" 't': 0.572},\n",
" {'x': 0.467,\n",
" 'y': 0.41300000000000003,\n",
" 'width': 0.086,\n",
" 'height': 0.1,\n",
" 't': 0.587},\n",
" {'x': 0.467, 'y': 0.271, 'width': 0.114, 'height': 0.114, 't': 0.729},\n",
" {'x': 0.767, 'y': 0.242, 'width': 0.114, 'height': 0.114, 't': 0.758},\n",
" {'x': 0.267, 'y': 0.242, 'width': 0.086, 'height': 0.086, 't': 0.758},\n",
" {'x': 0.767,\n",
" 'y': 0.09899999999999998,\n",
" 'width': 0.086,\n",
" 'height': 0.086,\n",
" 't': 0.901},\n",
" {'x': 0.567,\n",
" 'y': 0.09899999999999998,\n",
" 'width': 0.086,\n",
" 'height': 0.1,\n",
" 't': 0.901},\n",
" {'x': 0.467,\n",
" 'y': 0.09899999999999998,\n",
" 'width': 0.114,\n",
" 'height': 0.114,\n",
" 't': 0.901},\n",
" {'x': 0.267,\n",
" 'y': 0.09899999999999998,\n",
" 'width': 0.114,\n",
" 'height': 0.114,\n",
" 't': 0.901},\n",
" {'x': 0.167, 'y': 0.386, 'width': 0.086, 'height': 0.1, 't': 0.614},\n",
" {'x': 0.867, 'y': 0.505, 'width': 0.086, 'height': 0.086, 't': 0.495},\n",
" {'x': 0.967, 'y': 0.626, 'width': 0.086, 'height': 0.086, 't': 0.374}]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"for b in bb:\n",
" b['t'] = b['x']\n",
" b['x'] = b['y']\n",
" b['y'] = 1 - b['t']\n",
"bb"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "ba7cf79c-c493-4a7b-b8f8-93fe7c7ffc27",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"\n",
" \n",
" | \n",
"\n",
"\n",
"\n",
"shape | (100, 100, 3) | \n",
"dtype | uint8 | \n",
"size | 29.3 kB | \n",
"min | 0 | max | 255 | \n",
" \n",
" \n",
" | \n",
"
\n",
"
"
],
"text/plain": [
"[[[ 3 3 3]\n",
" [ 3 3 3]\n",
" [ 3 3 3]\n",
" ...\n",
" [ 9 9 9]\n",
" [ 6 6 6]\n",
" [ 6 6 6]]\n",
"\n",
" [[ 3 3 3]\n",
" [ 3 3 3]\n",
" [ 0 0 0]\n",
" ...\n",
" [ 9 9 9]\n",
" [12 12 12]\n",
" [ 9 9 9]]\n",
"\n",
" [[ 6 6 6]\n",
" [ 3 3 3]\n",
" [ 3 3 3]\n",
" ...\n",
" [ 6 6 6]\n",
" [ 9 9 9]\n",
" [ 6 6 6]]\n",
"\n",
" ...\n",
"\n",
" [[ 6 6 6]\n",
" [ 3 3 3]\n",
" [ 6 6 6]\n",
" ...\n",
" [ 6 6 6]\n",
" [ 6 6 6]\n",
" [ 3 3 3]]\n",
"\n",
" [[ 6 6 6]\n",
" [ 3 3 3]\n",
" [ 3 3 3]\n",
" ...\n",
" [ 6 6 6]\n",
" [ 6 6 6]\n",
" [ 6 6 6]]\n",
"\n",
" [[ 3 3 3]\n",
" [ 3 3 3]\n",
" [ 6 6 6]\n",
" ...\n",
" [ 9 9 9]\n",
" [ 6 6 6]\n",
" [ 6 6 6]]]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"new_image = stackview.add_bounding_boxes(image, bb)\n",
"\n",
"new_image"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "69158df7-1a80-4b08-8128-42c13af12610",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.11"
}
},
"nbformat": 4,
"nbformat_minor": 5
}