Day 9: Generative AI (Images)

Recap

“A photograph of an astronaut riding a horse.”

Training
- During training, random noise is added to images in steps
- Model learns to predict what noise was added (forward diffusion process)
Generation (process runs in reverse)
- Start with pure random noise
- Model estimates what noise should be removed to create a realistic image
- Using the text prompt, the model steers the process towards images that match the description

Image Generation Notebook

Image Generation Notebook

How was the quality of the images? Was it better or worse than you thought?

Similar process to image generation, but the input is an image instead of text
Often used to transform or restyle existing images
Model requires an input image and a guiding text prompt
- Example: “Turn this image into an anime drawing”

Image-to-Image Notebook

Image-to-Image Notebook

VLM Notebook

VLM Notebook

Not only can VLMs describe images, but you can also ask questions about the image
- “Read the text in the image”
- “How many people in this image have red shirts?”
- “Is the person in this image wearing glasses?”
This concept is known as reasoning and can be very powerful

Reasoning with VLMs

Reasoning with VLMs

Millions of people around the world live with visual impairments.
Assistive technology — screen readers, braille displays, and now AI — helps them navigate daily life independently.
Combined with audio output, VLMs can help transcribe scenes and other details

A VLM-powered seeing assistant has three steps:
- Capture: The camera takes a photo of the surroundings
- Describe: The VLM generates a short, clear description
- Speak: Text-to-speech reads the description aloud

VLMs as Assistive Technology

VLMs as Assistive Technology