CLIP transformer

CLIP transformer#

Ths CLIP transformer was originally developed to classify images, with unknown classes at training time. On can use a pre-trained model for multiple classification tasks, by simply definint different classes at inference time.

This notebook is modified from example code here.

from PIL import Image
import requests

from transformers import CLIPProcessor, CLIPModel

We first download a pre-trained model.

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

image = Image.open("data/real_cat.png")
image

../_images/50cc5c20349b3338b77309a70d3d8511375c5f83053328f5e459e83b89edc6ad.png

options = ["a photo of a cat", 
           "a photo of a dog", 
           "a photo of a microscope"]

#options = ["a photo of a cat", 
#           "a photo of a dog"]

inputs = processor(text=options, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)

logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities
probs

tensor([[0.1353, 0.0013, 0.8634]], grad_fn=<SoftmaxBackward0>)

label_probabilities = {k: v for k, v in zip(options, probs[0].tolist())}

label_probabilities

{'a photo of a cat': 0.13529185950756073,
 'a photo of a dog': 0.0012658964842557907,
 'a photo of a microscope': 0.8634422421455383}