from transformers import pipeline
import torch
pipe = pipeline(
"image-text-to-text",
model="google/gemma-3-4b-it",
device="cuda",
dtype=torch.bfloat16
)Using Gemma 3 (4B) to identify images
from IPython.display import Image
IMAGE_URL = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"
Image(url=IMAGE_URL, width=500)messages = [
{
"role": "system",
"content": [{"type": "text", "text": "You are a helpful assistant."}],
},
{
"role": "user",
"content": [
{"type": "image", "url": IMAGE_URL},
{"type": "text", "text": "What animal is on the candy?"},
],
},
]
output = pipe(text=messages, max_new_tokens=200)
print(output[0]["generated_text"][-1]["content"])Okay, let's take a look!
Based on the image, the animal on the candy is a **turtle**. You can see the shell pattern clearly.
Would you like to know anything else about these candies?