Module 4 Assignment: Multimedia & Multimodal Applications
Objective: Build a working application that demonstrates your understanding of multimedia or multimodal AI models.
Choose Your Adventure: Pick ONE of the three options below.
Option 1: ControlNet Scribble App
Build a Gradio application that transforms hand-drawn sketches into realistic images using ControlNet.
Requirements:
- Create a Gradio interface with a sketchpad/canvas input
- Use ControlNet (scribble or canny edge model) to condition image generation
- Allow users to enter a text prompt to guide the style/content
- Generate and display the resulting image
- Include at least one configurable parameter (e.g., guidance scale, number of steps)
Suggested approach:
- Use Replicate’s ControlNet models for easier deployment, OR
- Run locally using HuggingFace diffusers with a ControlNet pipeline
Option 2: Replicate Model Pipeline
Create a pipeline that chains multiple Replicate models together to transform images through a multi-step process.
Requirements:
- Chain at least 3 different models in sequence (e.g., depth estimation → ControlNet → upscaling)
- Create a Gradio interface that accepts an input image and displays intermediate/final results
- Document what each model in your pipeline does and why you chose it
- Show the transformation at each stage (not just the final output)
Example pipeline ideas:
- Photo → Depth Map → Stylized Scene → Upscaled Output
- Portrait → Pose Extraction → New Character in Same Pose → Background Replacement
- Sketch → Colorized Image → Style Transfer → Final Composition
Option 3: Vision Language Model Application
Implement a practical application using a Vision Language Model (VLM) for a real-world use case.
Requirements:
- Use an open-source VLM (e.g., Gemma 3, LLaVA, FastVLM) - not a closed API like GPT-4V or Claude
- Build a Gradio interface that accepts image input
- Implement a specific, practical use case such as:
- Accessibility: Describe images for visually impaired users
- Product Detection: Identify and catalog items from photos
- Document Analysis: Extract information from receipts, forms, or charts
- Educational: Explain diagrams, equations, or scientific figures
- Include thoughtful prompt engineering in your system prompt
Deliverable: A Colab/Jupyter notebook with:
- Code cells with your implementation
- A working Gradio interface that can be launched and tested
- Uses environment variables for any API keys (e.g.,
REPLICATE_API_TOKEN). Please do not include your API key in your notebook! - Markdown cells explaining:
- Which option you chose and why
- Your design decisions and approach
- Observations about model behavior, quality, or limitations
- What worked well and what was challenging
Hints
- Option 1: The
gr.Sketchpadorgr.ImageEditorcomponents in Gradio work well for drawing input. Start with a simple black-and-white sketch before adding complexity. - Option 2: Consider what each model needs as input and produces as output. The PBR notebook (
pbr-creator.ipynb) demonstrates this chaining pattern. - Option 3: Smaller models (like FastVLM-0.5B or Gemma 3 4B) can run on Colab’s T4 GPU. Focus on crafting a good system prompt that guides the model toward your specific use case.
- All options: Test with multiple different inputs to understand the model’s capabilities and limitations.