Using a VLM

In this notebook, we use a VLM (Vision Language Model) to capture an image from the camera, send it to the model and ask “What do you see in this image?”

import cv
import graphics
import time

canvas = graphics.canvas()
camera = cv.start_camera(canvas)

w = canvas.get_width()
h = canvas.get_height()
ctx = canvas.get_context('2d')

for count in ["3", "2", "1"]:
  ctx.fill_style = "rgba(0, 0, 0, 0.5)"
  ctx.fill_rect(0, 0, w, h)
  ctx.font = "bold 200px sans-serif"
  ctx.fill_style = "white"
  ctx.text_align = "center"
  ctx.text_baseline = "middle"
  ctx.fill_text(count, w / 2, h / 2)
  time.sleep(1)
  canvas.clear()

data_url = cv.capture_frame(camera)
camera.stop()
print(data_url[:500])

Now that we have the image captured, let’s give it to the model…

from openai import OpenAI
import os

client = OpenAI(
  base_url="https://openrouter.ai/api/v1",
  api_key=os.environ["OPENROUTER_API_KEY"],
)

response = client.chat.completions.create(
  model="qwen/qwen3-vl-8b-instruct",
  messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {"url": data_url},
        },
        {
          "type": "text",
          "text": "What do you see in this image?",
        },
      ],
    }
  ],
  stream=True
)

for chunk in response:
  # Each chunk contains a delta with the new content
  if chunk.choices[0].delta.content is not None:
    token = chunk.choices[0].delta.content
    print(token, end='', flush=True)

Now that the model has described your image, take a moment to reflect.

How accurate was the model’s description of your image? What details did it get right or miss? What other question would you want to ask the model about the image?

{ “question_type”: “freeform”, “question”: “What does VLM stand for?”, “answer”: “Vision Language Model”, “submitted_answer”: “” }

{ “question_type”: “multiple_choice”, “question”: “How is the captured image sent to the VLM in this notebook?”, “options”: [ { “key”: “a”, “text”: “As a file path on the server” }, { “key”: “b”, “text”: “As a base64-encoded data URL” }, { “key”: “c”, “text”: “As a public web URL” }, { “key”: “d”, “text”: “As a grid of pixel colour values” } ], “answer”: “b”, “submitted_answer”: “” }

{ “question_type”: “true_false”, “question”: “A Vision Language Model (VLM) can only process text — it cannot understand images.”, “answer”: “False”, “submitted_answer”: “” }