Using Gemma 3 (4B) to identify images

from transformers import pipeline
import torch

pipe = pipeline(
    "image-text-to-text",
    model="google/gemma-3-4b-it",
    device="cuda",
    dtype=torch.bfloat16
)

from IPython.display import Image

IMAGE_URL = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"

Image(url=IMAGE_URL, width=500)

messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a helpful assistant."}],
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "url": IMAGE_URL},
            {"type": "text", "text": "What animal is on the candy?"},
        ],
    },
]

output = pipe(text=messages, max_new_tokens=200)
print(output[0]["generated_text"][-1]["content"])

Okay, let's take a look!

Based on the image, the animal on the candy is a **turtle**. You can see the shell pattern clearly. 

Would you like to know anything else about these candies?