Exploring Generative AI Models: Part 2

Simon Guest

Recap of Last Week’s Lecture

  • Brief History of the Transformer
  • Model evolution, explored text-to-text models
  • API access, quantization, running locally

This Week

  • Dive into image-based models!
  • Explore the Diffuser
  • Editing images using Inpainting and ControlNet
  • Vision Transfomers and VLMs

Image Models

Images Models: Text-to-Image

“A photograph of an astronaut riding a horse.”

  • Based on a concept called a diffusion transformer
  • Commonly known as a diffuser
  • Two stage process, inspired by thermodynamics

Image Models: Diffuser

  • Training
    • During training, random noise is added to images in steps
    • Model learns to predict what noise was added (forward diffusion process)
  • Inference (process runs in reverse)
    • Start with pure random noise
    • Model estimates what noise should be removed to create a realistic image
    • Using the text prompt, the model steers the process towards images that match the description

Image Diffusion Models in 2022

timeline
  August 2022 : Stable Diffusion v1.4
              : First open-source high-quality model
  September 2022 : Stable Diffusion v1.5
                  : Refined version
  October 2022 : eDiff-I (NVIDIA)
                : Ensemble approach
  November 2022 : Stable Diffusion v2.0/2.1
                : Higher resolution (768x768)

Image Diffusion Models in 2023

timeline
  March 2023 : Midjourney v5
              : Exceptional artistic quality
  April 2023 : ControlNet
        : Precise spatial control
        : AnimateDiff - Video generation
  July 2023 : SDXL (Stable Diffusion XL)
            : 1024x1024 native resolution
  August 2023 : SDXL Turbo
              : Real-time capable generation

Image Diffusion Models in 2024

timeline
  February 2024 : Stable Diffusion 3
                : Improved text understanding
  June 2024 : Stable Diffusion 3.5
            : Multiple model sizes
  2024 : FLUX.1 (Black Forest Labs)
        : State-of-the-art open model
        : Imagen 3 (Google DeepMind)
        : Photorealistic quality

Demo: Text to Image Diffusion Process

Diffusion Text-to-Image.ipynb

Images Models: Image-to-Image

Image-to-Image: “Make this image different”

  • Originally solved by GAN approaches, but evolved into extension of the diffuser concept
    • Add noise to the original image (partial denoising)
    • Regenerate it with modifications based on the prompt
    • The original image heavily influences the output structure

Demo: Image to Image Process

Diffusion Image-to-Image.ipynb

Image-to-Image can also be used for…

  • Super resolution
  • Style transfer
  • Colorization (grayscale to color)
  • Depth maps

Depth Maps

Images (often greyscale) where each pixel’s value represents the distance

Depth Maps

Images (often greyscale) where each pixel’s value represents the distance

Depth Maps

Hardware vs. Models

  • Historically used custom hardware
    • Depth Camera (e.g., RealSense) - ~$300-500
    • Module/processing for realtime (60fps) sensing

Depth Maps

Hardware vs. Models

  • Image-to-Image depth estimation models
    • Depth Anything, MiDaS, ZoeDepth
    • Low latency (MiDaS 3.1 @ 20fps on embedded GPU)
  • Used for
    • 3D effects/estimation
    • Simple/low-cost robotics
    • Control input for other images

Depth Maps

Depth Map to Image (using black-forest-labs/flux-depth-pro)

Depth Maps

Depth Map to Image (using black-forest-labs/flux-depth-pro)

Depth Maps

Depth Map to Image (using black-forest-labs/flux-depth-pro)

Depth Maps

Depth Map to Image (using black-forest-labs/flux-depth-pro)

I Want More Control!

I Want More Control!

  • Text-to-Image and Image-to-Image don’t give you that much control
    • Extensive prompts (both positive and negative) can help, but only so far
    • Fine-tuning, but expensive and risk degrading quality
  • Two methods for adding more control
    • Inpainting (and Outpainting)
    • ControlNet

Inpainting

Filling in missing or masked regions of an image in a realistic way

  • Model is given an image with certain areas masked out
  • Generates plausible content to fill those areas based on the surrounding context (and steered by a text prompt)

Inpainting

Filling in missing or masked regions of an image in a realistic way

  • Model needs to understand context around the area
  • Generate content that matches the style, lighting, and perspective
  • Follow a text prompt that was likely different from the original

Outpainting

Inpainting applied to the edges of an image to extend it beyond it’s original size

Source: https://replicate.com/black-forest-labs/flux-fill-pro

Outpainting

Inpainting applied to the edges of an image to extend it beyond it’s original size

Source: https://replicate.com/black-forest-labs/flux-fill-pro

Outpainting

  • How does it work?
    • Supply a prompt: “2x zoom out this image”
    • Treat the new empty regions around the image as masked areas
    • Use impainting technique to fill in the regions

Outpainting

  • More challenging
    • Less context at the edges of the image vs. center/surrounded
    • Need to maintain the style, lighting, and perspective
    • Has to be creative. Can’t just be a repetitive pattern.

Inpainting

Popular models

  • Stable Diffusion (Many inpainting variants)
  • Flux Fill from Black Forest Labs
  • LaMa: Large Mask inpainting
  • Ideogram

Introducing Replicate

Source: https://replicate.com

Demo: Replicate

https://replicate.com

Demo: Calling Replicate API from Colab

Replicate Client.ipynb

ControlNet

Text-to-image is “hoping the model guesses what I mean”

  • Introducing ControlNet
    • Developed by Lvmin Zhang and Maneesh Agrawala at Stanford University
    • Published in February 2023 (Zhang et al. 2023)
    • ControlNet represented a paradigm shift from “describe what you want” to “show the structure you want”.

How ControlNet Works

  • Stable Diffusion’s U-Net has an encoder and decoder
  • Create a trainable copy of the encoder blocks
  • Train the copy of the encoder alongside the frozen SD model
    • During training: use paired data (e.g., pose skeleton → original image)
    • During inference: both encoders run together
    • Features from both are combined via zero convolutions
  • Key: The weights in the original SD model don’t change
  • ControlNet is analogous to a “Plug in” model

Demo: ControlNet Human Pose

ControlNet Human Pose.ipynb

Demo: Control via FLUX Kontext

ControlNet Human Pose (FLUX Kontext).ipynb

ControlNet

Examples of conditioning types:

  • Human pose (OpenPose) - skeleton/keypoint detection
  • Depth maps - 3D structure information
  • Canny edges - line drawings and edge detection
  • Scribbles - rough user drawings
  • QR codes - blended into images

Demo: Putting this all together

Using text-to-image, ControlNet, inpainting, and image-to-image (depth map) to create PBR materials

PBR Material Creator.ipynb

Using Transformers for Computer Vision

CNNs to Vision Transformer

Historically, computer vision has used classification models called CNNs (Convolutional Neural Networks)

  • Enter the Vision Transformer (ViT)
  • Tipping Point
    • Initially, ViTs didn’t outperform CNNs
    • But exceeded SOTA CNNs on larger datasets (such as Google’s JFT-300M)

Vision Language Models (VLMs)

ViTs by themselves are only so useful

  • VLMs
    • A vision encoder
    • Adapter/projector layer
    • Language model (LLaMa or GPT)
  • Also known as “Multimodal”
    • Image-Text-to-Text

Demo: Gemma-3 VLM

LMStudio / Gemma-3-27b-it GGUF for vision tasks

Use Cases for VLMs

  • Visual Understanding: “What’s in this image?”
  • Accessibility: Assisting users with visual impairments
  • Content Moderation and Safety: Identifying harmful content
  • Retail: Finding products with photos
  • Education: Helping students understand charts, diagrams, equations
  • Robotics: Providing Robots with information to navigate their environment

Demo: Apple’s FastVLM

Apple FastVLM running in WebGPU

Resources

Resources

Q&A

Topic Survey

Bibliography

Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, et al. 2021. “An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale.” arXiv Preprint arXiv:2010.11929.
Zhang, Lvmin, Anyi Rao, and Maneesh Agrawala. 2023. “Adding Conditional Control to Text-to-Image Diffusion Models.” arXiv Preprint arXiv:2302.05543.