Exploring Generative AI Models: Part 1

Simon Guest

Recap of Last Week’s Lecture

  • Introduced AI Agents, their uses, how to create them
  • About 50% had used an API to call AI Model
  • One or two beyond this

This Week

  • Two Part Lecture on AI Models
    • This Week
      • Explore text-to-text models
      • Model evolution, API access, running locally
      • Five Demos!
    • Next Week
      • Explore image models
      • Diffuser, ControlNet, and VLMs
      • More Demos!

A Brief History of Transformer Models

A Brief History of Transformer Models

timeline
    June 2017 : Google researchers publish "Attention is all you need" paper [1]
              : Introduces self-attention mechanism and transformer architecture
              : Eliminates the need for recurrent neural networks in sequence processing

    June 2018 : OpenAI releases GPT-1
              : 117M parameters
              : Demonstrates pre-training on large text corpora followed by fine-tuning works effectively

    Feb 2019 : OpenAI releases GPT-2
             : 1.5B parameters
             : Initially withheld full model due to concerns about misuse
             : Demonstrates impressive text generation capabilities with minimal fine-tuning

    May 2020 : OpenAI releases GPT-3
             : 175B parameters
             : Demonstrates strong few-shot learning capabilities
             : Marks a significant leap in model capabilities and scale

    June 2020 : GPT-3 available through OpenAI API
              : Still a completion model, not instruction-tuned

[1] Vaswani et al. (2017)

Completion vs. Instruction-Tuned

  • Completion Model just predicts the next token
    • Input prompt: Mary had a little
    • Max total tokens: 50
    • Temperature: 0 - 1.0
    • top_k: consider only the top k tokens in the response
    • top_p: Nucleus sampling (probability cut off - 0 and 1.0)
  • Output
    • Mary had a little lamb, its fleece was white as snow... (up to max tokens)

Completion vs. Instruction-Tuned

  • You can’t really converse with it
  • What is the capital of France? (max tokens = 50)
  • What is the capital of France? Paris. What is the capital of Spain? Madrid. What is the capital of
  • But it’s the foundation of today’s text models, and fun to play with…

Introducing Google Colab

Source: https://colab.research.google.com/signup

Demo: GPT-2

GPT-2.ipynb

Instruction-Tuned Models

  • Supervised Fine-Tuning
    • Large datasets of questions/answers, tasks/completions, demonstrating helpful assistant behavior
  • RLHF (Reinforcement Learning from Human Feedback)
    • Human raters rank different model responses, training a reward model
  • Chat Templates
    • Structured formats to distinguish speakers in a dialog: Typically system, user, and assistant

A Brief History of Transformer Models

timeline
    2021 : InstructGPT Development
          : Built on GPT-3 with RLHF fine-tuning
          : Trained to follow instructions and understand user intent
          : Key innovation enabling ChatGPT
    
    Jan 2021 : Anthropic Founded
             : Founded by Dario & Daniela Amodei with ~7 senior OpenAI employees
            : Dario led GPT-2/3 development and co-invented RLHF

    Nov 2022 : ChatGPT Launch
                  : Built on GPT-3.5 using RLHF
                  : 1M+ users in 5 days
                  : Sparked widespread interest in generative AI

    Feb 2023 : Llama 1 Released
                  : Meta's LLaMA (7B, 13B, 33B, 65B parameters)
                  : 13B model exceeded GPT-3 (175B) on most benchmarks
                  : Limited researcher access
                  : Text completion only (Alpaca fine-tune added instructions)

    Jul 2023 : Llama 2 Released
              : Available in 7B, 13B, 70B sizes
              : Trained on 40% more data than Llama 1
              : First open-weights Llama for commercial use

Closed vs. Open Models

  • Closed Source: Hosted models; no ability to inspect the weights of the models. Accessed via an API (or UI).
    • Examples: OpenAI GPT-5, Claude Sonnet 4.5
  • Open Weight: Model files with pretrained weights, but no training data. Host on your own hardware.
    • Examples: Meta’s Llama (and derivatives), Gemma
  • Open Source: Models with access to the training data set. Create from scratch.
    • Examples: OLMo from AI2

Calling Models via APIs

Calling Models via APIs

  • HTTP-based APIs
    • Client makes HTTP API calls to invoke/access the model
    • (Normally use an SDK to wrap the HTTP API calls)
    • Client passes Authorization token as part of the call
    • Default way of accessing OpenAI, Claude, other large, closed-source models

Demo: OpenAI SDK/API call

OpenAI.ipynb

OpenAI’s Chat Completions API

  • Debuted in March 2023, together with the ChatGPT API
  • Structure
    • Messages array (system, assistant, user)
    • Streaming support (using SSE - Server Side Events)
    • Function calling (added mid-2023)
    • Structured output (added Aug 2024)
  • Widespread Adoption
    • Anthropic, Azure, TogetherAI
    • Local hosting: vLLM, LM Studio

OpenAI’s Chat Completions API

How do we consume different models from multiple providers?

  • Introducing OpenRouter (https://openrouter.ai)
    • A unified API to hundreds of AI models through a single endpoint
    • (Using OpenAI’s Chat Completion API)
    • OpenAI, Claude, Gemini, Grok, Nova, Llama, DeepSeek, Qwen, and many others.
    • Pay per API call, often same cost as the provider

Demo: OpenRouter

https://openrouter.ai

OpenRouter.ipynb

Downloading and Running Models

Downloading and Running Models

  • So far, we’ve called hosted models via APIs
  • How about downloading and running models on your own hardware?
    • (Obviously they need to be open-weight models)

Downloading and Running Models

Why would you want to do this?

  • Offline access to models (no Internet required)
  • Potential cost savings (if many API calls and already own hardware)
    • e.g., running a small model embedded within a game
  • Want to fine-tune your own model and have the hardware to do it
  • Don’t want others to see the conversations you are having :)

Introducing Hugging Face

Source: https://huggingface.co

What is Hugging Face?

It is to AI models what GitHub is to source code

  • Explore, download models to run on local hardware
  • Upload and share your own trained/fine-tuned models and datasets
  • Create “Spaces” - web-based apps for accessing models

Demo: Exploring a Model

Google’s gemma-3-1b-it on Hugging Face

Hugging Face Transformers

Source: https://huggingface.co/docs/transformers

Hugging Face Transformers

What is the Hugging Face Transformers Library?

  • Open-source Python library to provide easy access to using various types of pre-trained transformer models
  • Brings together all of the different formats under one interface
    • Different models, vendors, types, chat templates
    • Different implementations: PyTorch, TensorFlow, JAX
  • A few lines of code to download and run the model

Demo: Using HF Transformers to download and use Gemma 3 1B

gemma-3-1b-it via transformers.ipynb

“Out of VRAM”

One challenge of running models on your own hardware is VRAM

  • Roughly speaking, the size of the model will determine how much VRAM you need
  • Gemma 3 models
    • gemma-3-1b-it = 2Gb
    • gemma-3-4b-it = 8.6Gb
    • gemma-3-12b-it = 23.37Gb
    • Qwen3-VL-235B-A22B-Thinking = ~475Gb

“Out of VRAM”

  • Google Colab Tiers
    • Colab Free T4 = 16Gb VRAM (15Gb usable)
    • Colab Pro V100 = 16Gb VRAM
    • Colab Pro A100 = 40Gb VRAM
  • Your Gaming PC
    • Probably 8Gb VRAM
  • Your Phone
    • V-what? :)

“Out of VRAM”

You can select smaller models, but they are less accurate / more prone to hallucination.

  • How do we fix this?
    • Quantization

Quantization

Process of reducing the precision of a model’s weights and activations. For example, 16-bit numbers to 4-bit.

  • Parameter count matters more than precision
    • A 70B parameter model at 4-bit often beats a 13B model at b16
    • The models knowledge remains largely intact
    • Often the extra precision doesn’t meaningfully improve outputs

Quantization Formats

The llama.cpp project (implementing LLMs in pure C/C++) has driven advancements in quantization

  • GGUF (GPT-Generated Unified Format)
    • Single file architecture
    • Model format supporting multiple quantization levels (2-bit through 8-bit) with CPU and GPU handoff
  • MLX (Apple’s ML framework and format for Apple Silicon)
    • Debuted in late 2023
    • Supports 4 and 8 bit quantization schemes

Running Quantized Models

  • Tools built upon llama.cpp
    • Ollama, LM Studio, koboldcpp

Demo: Running Gemma 3 27B GGUF on my laptop

LMStudio: gemma-3-27b-it-qat-q4_0-gguf

Demo: C# Client <-> Gemma 3 27B Local

demos/01/lmstudio-client/LMStudioClient.csproj

Hosting Models in Unity

  • Download the GGUF model locally to Assets/StreamingAssets folder
  • Use llama.cpp bindings for C# to host
    • LLAMASharp: https://github.com/SciSharp/LLamaSharp
  • Use OpenAI SDK (or similar) as client
  • Unity Demo
    • https://github.com/eublefar/LLAMASharpUnityDemo

Resources

Resources

Q&A

Bibliography

Vaswani, Ashish, Noam Shazeer, Niki Parmar, et al. 2017. “Attention Is All You Need.” Advances in Neural Information Processing Systems 30.