VLMs as Assistive Technology

Vision Language Models can do more than answer questions — they can help people with visual impairments understand their surroundings. In this notebook, we build a simple seeing assistant: the camera captures a photo every few seconds, sends it to a VLM, and reads the description aloud.

Open In Jupyter K-12

from openai import OpenAI
import os
import audio

client = OpenAI(
  base_url="https://openrouter.ai/api/v1",
  api_key=os.environ["OPENROUTER_API_KEY"],
)

VLM_MODEL = "qwen/qwen3-vl-8b-instruct"

VOICE_MAP = {
  "Female (US)": audio.Voice.EN_US.FEMALE,
  "Male (US)": audio.Voice.EN_US.MALE,
  "Female (UK)": audio.Voice.EN_GB.FEMALE,
  "Male (UK)": audio.Voice.EN_GB.MALE,
  "Female (AU)": audio.Voice.EN_AU.FEMALE,
}

def describe_scene(image_url):
  response = client.chat.completions.create(
    model=VLM_MODEL,
    messages=[{
      "role": "user",
      "content": [
        {"type": "image_url", "image_url": {"url": image_url}},
        {"type": "text", "text": "You are an assistive AI for someone who cannot see. Describe what is in front of the camera in 1-2 short sentences. Be direct, specific, and clear."},
      ]
    }],
    stream=False
  )
  return response.choices[0].message.content.strip()

Run the Seeing Assistant

Adjust the settings below, then run the cell. Point your camera at different objects or areas around you — the assistant will capture what it sees and read a description aloud.

import cv
import graphics
import time

INTERVAL = 5 #@param {type:"slider", min:3, max:10, step:1}
CAPTURES = 3 #@param {type:"slider", min:1, max:6, step:1}
VOICE = "Female (US)" #@param ["Female (US)", "Male (US)", "Female (UK)", "Male (UK)", "Female (AU)"]

canvas = graphics.canvas()
camera = cv.start_camera(canvas)
time.sleep(2)

voice = VOICE_MAP[VOICE]

for i in range(CAPTURES):
  if i > 0:
    time.sleep(INTERVAL)
  data_url = cv.capture_frame(camera)
  description = describe_scene(data_url)
  print(f"[{i + 1}/{CAPTURES}] {description}")
  audio.speak(description, voice=voice)

camera.stop()
print("Done.")

Think about what it would be like to rely on this kind of assistant every day.

Did the descriptions give you enough information to understand your surroundings? What would make them more useful? How might you change the prompt to better help someone who is visually impaired?

{ “question_type”: “multiple_choice”, “question”: “What is the main role of the VLM in the assistive tool built in this notebook?”, “options”: [ { “key”: “a”, “text”: “To control the camera zoom” }, { “key”: “b”, “text”: “To describe what the camera sees in words” }, { “key”: “c”, “text”: “To translate speech into different languages” }, { “key”: “d”, “text”: “To store images in the cloud” } ], “answer”: “b”, “submitted_answer”: “” }

{ “question_type”: “true_false”, “question”: “In this notebook, the VLM streams tokens one at a time to produce the spoken description.”, “answer”: “False”, “submitted_answer”: “” }

{ “question_type”: “freeform”, “question”: “What Python function is called to speak the VLM’s description aloud?”, “answer”: “audio.speak”, “submitted_answer”: “” }