{
"cells": [
{
"cell_type": "markdown",
"id": "1c7a3958-a52b-4446-8506-94f0c98863f6",
"metadata": {},
"source": [
"# Benchmarking spot counting using Vision Language Models\n",
"\n",
"There are some vision language models such as [moondream](https://huggingface.co/vikhyatk/moondream2) which are capable of counting objects in images by pointing at them. The model generates a list of point coordinates corresponding to locations in the image where prompted objects are. While the model is trained for natural images, it can be used to count bright blobs in dark images, too. This indicates that such models might be useful for microscopy image analysis. In this notebook we will benchmark how well it performs, first on an actual microscopy image of nuclei and furthermore on synthetic images the look similar."
]
},
{
"cell_type": "markdown",
"id": "38b84c3f-7a1d-4324-af3b-bb3be13f119f",
"metadata": {},
"source": [
"Installation (Windows):\n",
"* Download vips-dev-w64-all-8.16.1.zip from [here](https://github.com/libvips/build-win64-mxe/releases/tag/v8.16.1), unzip it, and add its subfolder `bin` to the PATH environment variable.\n",
"* `pip install einops pyvips`"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "ab4408ba-b33f-46d0-89cb-59133378f1ae",
"metadata": {},
"outputs": [],
"source": [
"from transformers import AutoModelForCausalLM, AutoTokenizer\n",
"from PIL import Image\n",
"from image_utilities import numpy_to_bytestream, extract_json, generate_spots\n",
"from tqdm import tqdm\n",
"import stackview\n",
"\n",
"model = AutoModelForCausalLM.from_pretrained(\n",
" \"vikhyatk/moondream2\",\n",
" revision=\"2025-01-09\",\n",
" trust_remote_code=True,\n",
" # Comment to run on CPU. To use the GPU, you need about 5 GB of GPU Memory.\n",
" device_map={\"\": \"cuda\"}\n",
")"
]
},
{
"cell_type": "markdown",
"id": "18c7adae-d56b-452b-bba9-aed363bc831d",
"metadata": {},
"source": [
"## Human mitosis\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "22263f34-6be6-4a8f-9bbb-a5e138dacee4",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"\n",
" \n",
" \n",
"\n",
"\n",
"\n",
"shape (100, 100) \n",
"dtype uint8 \n",
"size 9.8 kB \n",
"min 7 max 88 \n",
"
\n",
" \n",
" \n",
" \n",
"
"
],
"text/plain": [
"StackViewNDArray([[ 8, 8, 8, ..., 10, 9, 9],\n",
" [ 8, 8, 7, ..., 10, 11, 10],\n",
" [ 9, 8, 8, ..., 9, 10, 9],\n",
" ...,\n",
" [ 9, 8, 9, ..., 9, 9, 8],\n",
" [ 9, 8, 8, ..., 9, 9, 9],\n",
" [ 8, 8, 9, ..., 10, 9, 9]], dtype=uint8)"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import stackview\n",
"from skimage import data\n",
"import numpy as np\n",
"\n",
"# Load the human mitosis dataset\n",
"image = data.human_mitosis()[:100, :100]\n",
"\n",
"# Display the image\n",
"stackview.insight(image)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "2479e515-c803-4577-a9b0-ef2a2b54a8d2",
"metadata": {},
"outputs": [],
"source": [
"pil_image = Image.fromarray(image)\n",
"\n",
"encoded_image = model.encode_image(pil_image)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "dee1257a-b0f0-41b9-933f-cc40af031b47",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Pointing: 'bright spot'\n",
"Found 14 bright spot(s)\n"
]
}
],
"source": [
"# Pointing\n",
"print(\"\\nPointing: 'bright spot'\")\n",
"points = model.point(encoded_image, \"Mark the bright dots\")[\"points\"]\n",
"print(f\"Found {len(points)} bright spot(s)\")"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "2d3e02ac-fc6e-493b-a6b5-c36654a0cc93",
"metadata": {},
"outputs": [],
"source": [
"\n",
"box_half_size = 5 / image.shape[0]\n",
"bb = [{'x':p['x']-box_half_size, 'y':p['y']-box_half_size, 'width':2*box_half_size, 'height':2*box_half_size } for p in points]"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "602ecc12-4c6f-4b36-8542-a04ba10df765",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"\n",
" \n",
" \n",
"\n",
"\n",
"\n",
"shape (100, 100, 3) \n",
"dtype uint8 \n",
"size 29.3 kB \n",
"min 0 max 255 \n",
"
\n",
" \n",
" \n",
" \n",
"
"
],
"text/plain": [
"StackViewNDArray([[[ 3, 3, 3],\n",
" [ 3, 3, 3],\n",
" [ 3, 3, 3],\n",
" ...,\n",
" [ 9, 9, 9],\n",
" [ 6, 6, 6],\n",
" [ 6, 6, 6]],\n",
"\n",
" [[ 3, 3, 3],\n",
" [ 3, 3, 3],\n",
" [ 0, 0, 0],\n",
" ...,\n",
" [ 9, 9, 9],\n",
" [12, 12, 12],\n",
" [ 9, 9, 9]],\n",
"\n",
" [[ 6, 6, 6],\n",
" [ 3, 3, 3],\n",
" [ 3, 3, 3],\n",
" ...,\n",
" [ 6, 6, 6],\n",
" [ 9, 9, 9],\n",
" [ 6, 6, 6]],\n",
"\n",
" ...,\n",
"\n",
" [[ 6, 6, 6],\n",
" [ 3, 3, 3],\n",
" [ 6, 6, 6],\n",
" ...,\n",
" [ 6, 6, 6],\n",
" [ 6, 6, 6],\n",
" [ 3, 3, 3]],\n",
"\n",
" [[ 6, 6, 6],\n",
" [ 3, 3, 3],\n",
" [ 3, 3, 3],\n",
" ...,\n",
" [ 6, 6, 6],\n",
" [ 6, 6, 6],\n",
" [ 6, 6, 6]],\n",
"\n",
" [[ 3, 3, 3],\n",
" [ 3, 3, 3],\n",
" [ 6, 6, 6],\n",
" ...,\n",
" [ 9, 9, 9],\n",
" [ 6, 6, 6],\n",
" [ 6, 6, 6]]], dtype=uint8)"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"stackview.add_bounding_boxes(image, bb)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "0f6ac651-0bb3-4a32-9f06-a1dad92018ba",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████████████████████████████████████████████████████████████████████████████| 71/71 [38:58<00:00, 32.93s/it]\n"
]
}
],
"source": [
"counts_mean = []\n",
"counts_std = []\n",
"counts_gt = []\n",
"images = []\n",
"\n",
"for n in tqdm(range(1, 72, 1)):\n",
" counts_gt.append(n)\n",
"\n",
" coords, image = generate_spots(n=n, sigma=5)\n",
" pil_image = Image.fromarray(image)\n",
" encoded_image = model.encode_image(pil_image)\n",
" \n",
" # Run model.point 10 times and collect results\n",
" run_counts = []\n",
" first_run_points = None\n",
" \n",
" for _ in range(10):\n",
" points = model.point(encoded_image, \"Mark the bright dots\")[\"points\"]\n",
" run_counts.append(len(points))\n",
" \n",
" # Store points from first run for visualization\n",
" if first_run_points is None:\n",
" first_run_points = points\n",
" \n",
" # Calculate mean and standard deviation\n",
" counts_mean.append(np.mean(run_counts))\n",
" counts_std.append(np.std(run_counts))\n",
" \n",
" # Use points from first run for visualization\n",
" box_half_size = 5 / image.shape[0]\n",
" bb = [{'x':p['x']-box_half_size, 'y':p['y']-box_half_size, 'width':2*box_half_size, 'height':2*box_half_size } for p in first_run_points]\n",
" images.append(stackview.add_bounding_boxes(image, bb))"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "d0e2d8b6-890c-4353-a83f-df677b76a53e",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import matplotlib.pyplot as plt\n",
"\n",
"# Create a figure\n",
"plt.figure(figsize=(6, 4))\n",
"\n",
"# Create x values for plotting (use counts_gt)\n",
"x = np.array(counts_gt)\n",
"\n",
"# Plot counts_mean vs counts_gt with error bars from counts_std\n",
"plt.errorbar(x, counts_mean, yerr=counts_std, fmt='o', capsize=5, \n",
" markersize=6, label='Detected Points (mean ± std)')\n",
"\n",
"# Add diagonal line (y=x) representing perfect detection\n",
"max_val = max(max(x), max(counts_mean) + max(counts_std))\n",
"plt.plot([0, max_val], [0, max_val], 'k--', label='Perfect Detection')\n",
"\n",
"# Add labels and title\n",
"plt.xlabel('Ground Truth (Number of Points)', fontsize=12)\n",
"plt.ylabel('Detected Points', fontsize=12)\n",
"plt.title('Point Detection Performance with Uncertainty', fontsize=14)\n",
"\n",
"# Add grid and legend\n",
"plt.grid(True, alpha=0.3)\n",
"plt.legend(fontsize=12)\n",
"\n",
"# Set equal aspect ratio to make the plot square\n",
"plt.axis('equal')\n",
"plt.tight_layout()\n",
"\n",
"# Show the plot\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "dc83ffb7-3123-4070-92d2-6cdb0fad44e1",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
" "
],
"text/plain": [
""
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"stackview.animate(images, frame_delay_ms=500, filename=\"moondream_detecting_spots.gif\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "20fbdf05-221e-4ec0-8da3-0e6fba6e9451",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.11"
}
},
"nbformat": 4,
"nbformat_minor": 5
}