{ "cells": [ { "cell_type": "markdown", "id": "1c7a3958-a52b-4446-8506-94f0c98863f6", "metadata": {}, "source": [ "# Moondream LLM\n", "\n", "VLMs such as [moondream](https://huggingface.co/vikhyatk/moondream2) allow us to draw bounding boxes around objects in images." ] }, { "cell_type": "markdown", "id": "38b84c3f-7a1d-4324-af3b-bb3be13f119f", "metadata": {}, "source": [ "The installation (on Windows) is a bit tricky:\n", "* Download vips-dev-w64-all-8.16.1.zip from [here](https://github.com/libvips/build-win64-mxe/releases/tag/v8.16.1), unzip it, and add its subfolder `bin` to the PATH environment variable.\n", "* `pip install einops pyvips`" ] }, { "cell_type": "code", "execution_count": 6, "id": "ab4408ba-b33f-46d0-89cb-59133378f1ae", "metadata": {}, "outputs": [], "source": [ "from transformers import AutoModelForCausalLM, AutoTokenizer\n", "from PIL import Image\n", "from image_utilities import numpy_to_bytestream, extract_json\n", "from tqdm import tqdm\n", "import stackview\n", "from skimage.io import imread\n", "import numpy as np\n", "\n", "\n", "model = AutoModelForCausalLM.from_pretrained(\n", " \"vikhyatk/moondream2\",\n", " revision=\"2025-01-09\",\n", " trust_remote_code=True,\n", " # Comment to run on CPU. To use the GPU, you need about 5 GB of GPU Memory.\n", " device_map={\"\": \"cuda\"}\n", ")" ] }, { "cell_type": "markdown", "id": "18c7adae-d56b-452b-bba9-aed363bc831d", "metadata": {}, "source": [ "## Example data\n", "We load an example RGB image first." ] }, { "cell_type": "code", "execution_count": 7, "id": "22263f34-6be6-4a8f-9bbb-a5e138dacee4", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
shape(512, 512, 3)
dtypeuint8
size768.0 kB
min0
max255
\n", "\n", "
" ], "text/plain": [ "StackViewNDArray([[[176, 178, 179],\n", " [175, 178, 178],\n", " [177, 177, 180],\n", " ...,\n", " [182, 186, 188],\n", " [185, 188, 191],\n", " [191, 194, 197]],\n", "\n", " [[178, 180, 181],\n", " [178, 179, 181],\n", " [178, 180, 181],\n", " ...,\n", " [185, 189, 192],\n", " [187, 191, 192],\n", " [191, 195, 198]],\n", "\n", " [[181, 183, 185],\n", " [180, 182, 183],\n", " [180, 181, 183],\n", " ...,\n", " [190, 193, 196],\n", " [189, 193, 196],\n", " [192, 195, 198]],\n", "\n", " ...,\n", "\n", " [[125, 91, 66],\n", " [124, 90, 65],\n", " [123, 89, 65],\n", " ...,\n", " [137, 92, 64],\n", " [136, 91, 62],\n", " [135, 89, 61]],\n", "\n", " [[122, 88, 64],\n", " [121, 87, 63],\n", " [121, 87, 63],\n", " ...,\n", " [142, 96, 68],\n", " [142, 96, 68],\n", " [139, 94, 65]],\n", "\n", " [[120, 86, 62],\n", " [120, 86, 60],\n", " [119, 85, 61],\n", " ...,\n", " [144, 99, 70],\n", " [144, 99, 70],\n", " [142, 97, 68]]], dtype=uint8)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "image = imread(\"data/real_cat.png\")\n", "\n", "# Display the image\n", "stackview.insight(image)" ] }, { "cell_type": "markdown", "id": "c9921502-874b-472f-a43f-57898edea006", "metadata": {}, "source": [ "This image needs to be converted to a Pillow Image, before we can encode it." ] }, { "cell_type": "code", "execution_count": 8, "id": "2479e515-c803-4577-a9b0-ef2a2b54a8d2", "metadata": {}, "outputs": [], "source": [ "pil_image = Image.fromarray(image)\n", "\n", "encoded_image = model.encode_image(pil_image)" ] }, { "cell_type": "markdown", "id": "479c5dac-aa10-4924-bbdb-fe0d05e5f1d4", "metadata": {}, "source": [ "## Pointing\n", "We can then ask for coordinates in the image where given objects are." ] }, { "cell_type": "code", "execution_count": 9, "id": "dee1257a-b0f0-41b9-933f-cc40af031b47", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Found 1 bright spot(s)\n" ] }, { "data": { "text/plain": [ "[{'x': 0.6943359375, 'y': 0.427734375}]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "prompt = \"Mark the animals\"\n", "\n", "points = model.point(encoded_image, prompt)[\"points\"]\n", "\n", "print(f\"Found {len(points)} bright spot(s)\")\n", "points" ] }, { "cell_type": "code", "execution_count": 10, "id": "602ecc12-4c6f-4b36-8542-a04ba10df765", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
shape(512, 512, 3)
dtypeuint8
size768.0 kB
min0
max255
\n", "\n", "
" ], "text/plain": [ "StackViewNDArray([[[176, 178, 179],\n", " [175, 178, 178],\n", " [177, 177, 180],\n", " ...,\n", " [182, 186, 188],\n", " [185, 188, 191],\n", " [191, 194, 197]],\n", "\n", " [[178, 180, 181],\n", " [178, 179, 181],\n", " [178, 180, 181],\n", " ...,\n", " [185, 189, 192],\n", " [187, 191, 192],\n", " [191, 195, 198]],\n", "\n", " [[181, 183, 185],\n", " [180, 182, 183],\n", " [180, 181, 183],\n", " ...,\n", " [190, 193, 196],\n", " [189, 193, 196],\n", " [192, 195, 198]],\n", "\n", " ...,\n", "\n", " [[125, 91, 66],\n", " [124, 90, 65],\n", " [123, 89, 65],\n", " ...,\n", " [137, 92, 64],\n", " [136, 91, 62],\n", " [135, 89, 61]],\n", "\n", " [[122, 88, 64],\n", " [121, 87, 63],\n", " [121, 87, 63],\n", " ...,\n", " [142, 96, 68],\n", " [142, 96, 68],\n", " [139, 94, 65]],\n", "\n", " [[120, 86, 62],\n", " [120, 86, 60],\n", " [119, 85, 61],\n", " ...,\n", " [144, 99, 70],\n", " [144, 99, 70],\n", " [142, 97, 68]]], dtype=uint8)" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stackview.add_bounding_boxes(image, points)" ] }, { "cell_type": "markdown", "id": "1f6c548f-5c8a-43c5-a3db-0202939933e8", "metadata": {}, "source": [ "## Bounding boxes\n", "Better for visualization might be a bounding box surrounding the object of interest." ] }, { "cell_type": "code", "execution_count": 12, "id": "570083e0-b086-4736-9dd3-5a4fd483b610", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'x_min': 0.39208984375,\n", " 'y_min': 0.041015625,\n", " 'x_max': 0.91650390625,\n", " 'y_max': 0.796875}]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bb = model.detect(encoded_image, prompt)[\"objects\"]\n", "bb" ] }, { "cell_type": "code", "execution_count": 13, "id": "aa46ecc5-1a1e-49b3-afb9-d8fb147cf83c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
shape(512, 512, 3)
dtypeuint8
size768.0 kB
min0
max255
\n", "\n", "
" ], "text/plain": [ "StackViewNDArray([[[176, 178, 179],\n", " [175, 178, 178],\n", " [177, 177, 180],\n", " ...,\n", " [182, 186, 188],\n", " [185, 188, 191],\n", " [191, 194, 197]],\n", "\n", " [[178, 180, 181],\n", " [178, 179, 181],\n", " [178, 180, 181],\n", " ...,\n", " [185, 189, 192],\n", " [187, 191, 192],\n", " [191, 195, 198]],\n", "\n", " [[181, 183, 185],\n", " [180, 182, 183],\n", " [180, 181, 183],\n", " ...,\n", " [190, 193, 196],\n", " [189, 193, 196],\n", " [192, 195, 198]],\n", "\n", " ...,\n", "\n", " [[125, 91, 66],\n", " [124, 90, 65],\n", " [123, 89, 65],\n", " ...,\n", " [137, 92, 64],\n", " [136, 91, 62],\n", " [135, 89, 61]],\n", "\n", " [[122, 88, 64],\n", " [121, 87, 63],\n", " [121, 87, 63],\n", " ...,\n", " [142, 96, 68],\n", " [142, 96, 68],\n", " [139, 94, 65]],\n", "\n", " [[120, 86, 62],\n", " [120, 86, 60],\n", " [119, 85, 61],\n", " ...,\n", " [144, 99, 70],\n", " [144, 99, 70],\n", " [142, 97, 68]]], dtype=uint8)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stackview_bb = [{\n", " \"x\": b[\"x_min\"], \n", " \"y\": b[\"y_min\"], \n", " \"width\": b[\"x_max\"] - b[\"x_min\"],\n", " \"height\": b[\"y_max\"] - b[\"y_min\"]\n", " } for b in bb\n", "]\n", "\n", "image_with_bb = stackview.add_bounding_boxes(image, stackview_bb)\n", "\n", "image_with_bb" ] }, { "cell_type": "markdown", "id": "153f2c38-c265-4985-96cf-ba6b06842837", "metadata": {}, "source": [ "## Exercise\n", "Use different prompts for drawing bounding boxes on this image. Consider asking for specific devices, and also ask for drawing bounding boxes around all objects in the image. What other objects are detected?" ] }, { "cell_type": "code", "execution_count": null, "id": "d3e487d1-90f0-4149-9c5b-77bc89159ca2", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.11" } }, "nbformat": 4, "nbformat_minor": 5 }