Exploring Generative AI Models: Part 1

Simon Guest

Recap of Last Week’s Lecture

Introduced AI Agents, their uses, how to create them
About 50% had used an API to call AI Model
One or two beyond this

This Week

Two Part Lecture on AI Models
- This Week
  - Explore text-to-text models
  - Model evolution, API access, running locally
  - Five Demos!
- Next Week
  - Explore image models
  - Diffuser, ControlNet, and VLMs
  - More Demos!

A Brief History of Transformer Models

timeline
    June 2017 : Google researchers publish "Attention is all you need" paper [1]
              : Introduces self-attention mechanism and transformer architecture
              : Eliminates the need for recurrent neural networks in sequence processing

    June 2018 : OpenAI releases GPT-1
              : 117M parameters
              : Demonstrates pre-training on large text corpora followed by fine-tuning works effectively

    Feb 2019 : OpenAI releases GPT-2
             : 1.5B parameters
             : Initially withheld full model due to concerns about misuse
             : Demonstrates impressive text generation capabilities with minimal fine-tuning

    May 2020 : OpenAI releases GPT-3
             : 175B parameters
             : Demonstrates strong few-shot learning capabilities
             : Marks a significant leap in model capabilities and scale

    June 2020 : GPT-3 available through OpenAI API
              : Still a completion model, not instruction-tuned

[1] Vaswani et al. (2017)

Completion vs. Instruction-Tuned

Completion Model just predicts the next token
- Input prompt: Mary had a little
- Max total tokens: 50
- Temperature: 0 - 1.0
- top_k: consider only the top k tokens in the response
- top_p: Nucleus sampling (probability cut off - 0 and 1.0)
Output
- Mary had a little lamb, its fleece was white as snow... (up to max tokens)

Completion vs. Instruction-Tuned

You can’t really converse with it
What is the capital of France? (max tokens = 50)
What is the capital of France? Paris. What is the capital of Spain? Madrid. What is the capital of
But it’s the foundation of today’s text models, and fun to play with…

Introducing Google Colab

Source: https://colab.research.google.com/signup

Demo: GPT-2

GPT-2.ipynb

Instruction-Tuned Models

Supervised Fine-Tuning
- Large datasets of questions/answers, tasks/completions, demonstrating helpful assistant behavior
RLHF (Reinforcement Learning from Human Feedback)
- Human raters rank different model responses, training a reward model
Chat Templates
- Structured formats to distinguish speakers in a dialog: Typically system, user, and assistant

A Brief History of Transformer Models

timeline
    2021 : InstructGPT Development
          : Built on GPT-3 with RLHF fine-tuning
          : Trained to follow instructions and understand user intent
          : Key innovation enabling ChatGPT
    
    Jan 2021 : Anthropic Founded
             : Founded by Dario & Daniela Amodei with ~7 senior OpenAI employees
            : Dario led GPT-2/3 development and co-invented RLHF

    Nov 2022 : ChatGPT Launch
                  : Built on GPT-3.5 using RLHF
                  : 1M+ users in 5 days
                  : Sparked widespread interest in generative AI

    Feb 2023 : Llama 1 Released
                  : Meta's LLaMA (7B, 13B, 33B, 65B parameters)
                  : 13B model exceeded GPT-3 (175B) on most benchmarks
                  : Limited researcher access
                  : Text completion only (Alpaca fine-tune added instructions)

    Jul 2023 : Llama 2 Released
              : Available in 7B, 13B, 70B sizes
              : Trained on 40% more data than Llama 1
              : First open-weights Llama for commercial use

Closed vs. Open Models

Closed Source: Hosted models; no ability to inspect the weights of the models. Accessed via an API (or UI).
- Examples: OpenAI GPT-5, Claude Sonnet 4.5
Open Weight: Model files with pretrained weights, but no training data. Host on your own hardware.
- Examples: Meta’s Llama (and derivatives), Gemma
Open Source: Models with access to the training data set. Create from scratch.
- Examples: OLMo from AI2

Calling Models via APIs

HTTP-based APIs
- Client makes HTTP API calls to invoke/access the model
- (Normally use an SDK to wrap the HTTP API calls)
- Client passes Authorization token as part of the call
- Default way of accessing OpenAI, Claude, other large, closed-source models

Demo: OpenAI SDK/API call

OpenAI.ipynb

OpenAI’s Chat Completions API

Debuted in March 2023, together with the ChatGPT API
Structure
- Messages array (system, assistant, user)
- Streaming support (using SSE - Server Side Events)
- Function calling (added mid-2023)
- Structured output (added Aug 2024)
Widespread Adoption
- Anthropic, Azure, TogetherAI
- Local hosting: vLLM, LM Studio

OpenAI’s Chat Completions API

How do we consume different models from multiple providers?

Introducing OpenRouter (https://openrouter.ai)
- A unified API to hundreds of AI models through a single endpoint
- (Using OpenAI’s Chat Completion API)
- OpenAI, Claude, Gemini, Grok, Nova, Llama, DeepSeek, Qwen, and many others.
- Pay per API call, often same cost as the provider

Demo: OpenRouter

https://openrouter.ai

OpenRouter.ipynb

Downloading and Running Models

So far, we’ve called hosted models via APIs
How about downloading and running models on your own hardware?
- (Obviously they need to be open-weight models)

Downloading and Running Models

Why would you want to do this?

Offline access to models (no Internet required)
Potential cost savings (if many API calls and already own hardware)
- e.g., running a small model embedded within a game
Want to fine-tune your own model and have the hardware to do it
Don’t want others to see the conversations you are having :)

Introducing Hugging Face

Source: https://huggingface.co

What is Hugging Face?

It is to AI models what GitHub is to source code

Explore, download models to run on local hardware
Upload and share your own trained/fine-tuned models and datasets
Create “Spaces” - web-based apps for accessing models

Demo: Exploring a Model

Google’s gemma-3-1b-it on Hugging Face

Hugging Face Transformers

Source: https://huggingface.co/docs/transformers

Hugging Face Transformers

What is the Hugging Face Transformers Library?

Open-source Python library to provide easy access to using various types of pre-trained transformer models
Brings together all of the different formats under one interface
- Different models, vendors, types, chat templates
- Different implementations: PyTorch, TensorFlow, JAX
A few lines of code to download and run the model

Demo: Using HF Transformers to download and use Gemma 3 1B

gemma-3-1b-it via transformers.ipynb

“Out of VRAM”

One challenge of running models on your own hardware is VRAM

Roughly speaking, the size of the model will determine how much VRAM you need
Gemma 3 models
- gemma-3-1b-it = 2Gb
- gemma-3-4b-it = 8.6Gb
- gemma-3-12b-it = 23.37Gb
- Qwen3-VL-235B-A22B-Thinking = ~475Gb

“Out of VRAM”

Google Colab Tiers
- Colab Free T4 = 16Gb VRAM (15Gb usable)
- Colab Pro V100 = 16Gb VRAM
- Colab Pro A100 = 40Gb VRAM
Your Gaming PC
- Probably 8Gb VRAM
Your Phone
- V-what? :)

“Out of VRAM”

You can select smaller models, but they are less accurate / more prone to hallucination.

How do we fix this?
- Quantization

Quantization

Process of reducing the precision of a model’s weights and activations. For example, 16-bit numbers to 4-bit.

Parameter count matters more than precision
- A 70B parameter model at 4-bit often beats a 13B model at b16
- The models knowledge remains largely intact
- Often the extra precision doesn’t meaningfully improve outputs

Quantization Formats

The llama.cpp project (implementing LLMs in pure C/C++) has driven advancements in quantization

GGUF (GPT-Generated Unified Format)
- Single file architecture
- Model format supporting multiple quantization levels (2-bit through 8-bit) with CPU and GPU handoff
MLX (Apple’s ML framework and format for Apple Silicon)
- Debuted in late 2023
- Supports 4 and 8 bit quantization schemes

Running Quantized Models

Tools built upon llama.cpp
- Ollama, LM Studio, koboldcpp

Demo: Running Gemma 3 27B GGUF on my laptop

LMStudio: gemma-3-27b-it-qat-q4_0-gguf

Demo: C# Client <-> Gemma 3 27B Local

demos/01/lmstudio-client/LMStudioClient.csproj

Hosting Models in Unity

Download the GGUF model locally to Assets/StreamingAssets folder
Use llama.cpp bindings for C# to host
- LLAMASharp: https://github.com/SciSharp/LLamaSharp
Use OpenAI SDK (or similar) as client
Unity Demo
- https://github.com/eublefar/LLAMASharpUnityDemo

Resources

This slide deck, resources, links, notebooks, everything:
- https://simonguest.github.io/CSP
- (I’ll also post to the GAM-400 and CSP-300/400 channels)

Q&A

Bibliography

Vaswani, Ashish, Noam Shazeer, Niki Parmar, et al. 2017. “Attention Is All You Need.” Advances in Neural Information Processing Systems 30.