Glossary

Key terms, concepts, and acronyms from CS-394/594: How Generative AI Works.

Architectures and Model Types

ALM (Audio-Language Model)
A multimodal model that processes both audio and text together. (Module 4)
Autoregressive Generation
A text generation approach that produces one token at a time, feeding each output back as input for the next prediction. Because of this, the same prompt can produce different outputs. (Module 1)
CLIP
OpenAI’s vision encoder trained on 400 million image-text pairs. Creates a shared embedding space between images and text, making it the foundation of most VLMs. (Module 4)
CNN (Convolutional Neural Network)
A classic neural network architecture for image tasks, based on convolution operations. Historically used for image classification before Vision Transformers became dominant. (Module 4)
Decoder-only Architecture
A transformer variant where self-attention is causal/masked — tokens can only attend to previous tokens, not future ones. The basis of GPT-style models. (Module 1)
DINO / DINOv2
Meta’s self-supervised vision transformer (“Self DIstillation with NO Labels”), trained on 142 million images without labels. (Module 4)
Diffusion Model
An image generation architecture inspired by thermodynamics. During training, noise is progressively added to images. During inference, the model starts from random noise and iteratively removes it, guided by a text prompt. (Module 4)
Encoder-Decoder (Seq2Seq)
A transformer variant with both an encoder (which generates contextual representations via self-attention) and a decoder (which generates output tokens one at a time, using cross-attention to the encoder’s output). Used in translation models. (Module 1)
FastVLM / FastViTHD
Apple’s efficient Vision Language Model combining transformers and convolutional layers, optimized for on-device real-time performance. (Module 4)
GPT (Generative Pre-trained Transformer)
A decoder-only transformer architecture pre-trained on next-token prediction. The basis for models like ChatGPT. (Modules 1, 2)
LLaVA
An influential open-source Vision Language Model developed by University of Wisconsin-Madison and Microsoft Research. (Module 4)
LLM (Large Language Model)
A large-scale neural network trained on massive text corpora to understand and generate human language. The primary focus of this course. (Modules 0–8)
MMDiT (Multimodal Diffusion Transformer)
The architecture used by FLUX image generation models. Processes text and image tokens together in a unified transformer, replacing the U-Net architecture used in Stable Diffusion. (Module 4)
MoE (Mixture of Experts)
A neural network architecture with multiple “expert” sub-networks and a routing layer that activates only a subset of experts for each input token. Enables larger effective model size while keeping active compute low. (Module 5)
RNN (Recurrent Neural Network)
An older sequence modeling architecture superseded by the Transformer for NLP tasks. (Module 1)
SLM (Small Language Model)
Smaller language models designed to run on local or consumer-grade hardware. (Modules 0, 5, 6)
Swin Transformer
Microsoft’s vision transformer using a “shifted window” attention strategy; excels at dense prediction tasks like object detection and segmentation. (Module 4)
Transformer
The neural network architecture introduced in the 2017 paper “Attention Is All You Need.” Eliminated the need for RNNs in sequence tasks by using attention mechanisms. The foundation of virtually all modern LLMs. (Modules 1, 2, 4)
U-Net
A neural network architecture with an encoder-decoder structure used in Stable Diffusion for image generation. (Module 4)
ViT (Vision Transformer)
A transformer applied to images by dividing them into 16×16 patches and treating each patch as a token. Introduced in “An Image is Worth 16×16 Words.” (Module 4)
VLM (Vision Language Model)
A multimodal model combining a vision encoder, an adapter/projector layer, and a language model. Enables image-and-text-to-text tasks. (Modules 4, 6)

Training Concepts and Techniques

Alignment
Post-training refinement that shapes a model toward preferred behaviors and values. Includes techniques like RLHF and Constitutional AI. (Modules 7, 8)
Back Propagation
The algorithm for computing gradients through a neural network and updating model weights to reduce loss. (Module 7)
Batch Size
The number of training examples processed together in a single forward/backward pass. (Module 7)
Constitutional AI
Anthropic’s alignment approach that uses a written set of principles to guide model behavior during training. (Module 8)
Data Poisoning / Watermarking (Glaze / Nightshade)
Techniques developed at UChicago that add adversarial perturbations to images, either to degrade AI training quality or to embed protective watermarks. (Module 8)
Distillation (Knowledge Distillation)
Training a smaller model using the outputs of a larger, more capable model. Also used maliciously to extract capabilities from commercial models. (Modules 6, 8)
Epoch
One complete pass through the entire training dataset. (Module 7)
Expert Collapse
A failure mode in MoE training where a few experts handle nearly all tokens and the rest go largely unused. (Module 5)
Fine-tuning
Continuing to train a pre-trained model on a smaller, curated dataset to adapt it to a specific task, style, or behavior. (Modules 6, 7)
FrankenMoE / MoErge
A community approach to creating MoE models by combining the FFN layers of multiple specialized models (e.g., math, coding, chat) into a single model with a new router network. (Module 5)
Gradient Accumulation
Accumulating gradients over multiple mini-batches before taking an optimizer step. A memory-efficient way to simulate a larger effective batch size. (Module 7)
Instruction-tuning
Fine-tuning a base model on large datasets of question/answer pairs and task-completion examples to make it follow instructions and behave as a helpful assistant. (Module 2)
Learning Rate
A hyperparameter controlling how large a step the optimizer takes when updating weights. Too high causes instability; too low causes slow convergence. (Module 7)
LoRA (Low-Rank Adaptation)
A parameter-efficient fine-tuning method that freezes the base model weights and introduces two small trainable matrices (A and B) whose product captures the desired behavioral change. Results in a small, portable “adapter.” (Module 7)
Overfitting
When a model memorizes training data rather than learning to generalize. Detected by validation loss increasing while training loss continues to decrease. (Module 7)
Pretraining
The initial large-scale training run that teaches the model language by predicting the next token over a massive text corpus. Produces the “base model.” (Modules 2, 7)
QLoRA (Quantized LoRA)
A variant of LoRA that quantizes the base model weights to 4-bit (NF4 format) to reduce memory usage during fine-tuning, while keeping the adapter matrices at full precision. (Module 7)
RLHF (Reinforcement Learning from Human Feedback)
A training technique where human raters rank different model responses, training a reward model that guides further fine-tuning. Used to create InstructGPT and ChatGPT. (Modules 2, 3)
Supervised Fine-Tuning (SFT)
Fine-tuning a pre-trained model on a curated labeled dataset. The standard first step in aligning a base model to follow instructions. (Modules 2, 6, 7)
Synthetic Data / Data Distillation
Generating training examples by prompting a more capable model (e.g., “Generate 100 examples of a student asking a teacher a geography question”). (Module 6)
Upcycling
A MoE training strategy that starts from an existing dense model, replicates its layers into multiple experts, and adds router networks. Faster and more stable than training a MoE from scratch. (Module 5)
Validation Loss
Model performance measured on a held-out dataset not used during training. Used to detect overfitting. (Module 7)

Embeddings, Tokenization, and Attention

Attention Mechanism
The core mechanism in transformers that allows each token to weigh the importance of every other token in the sequence when building its representation. (Module 1)
BPE (Byte Pair Encoding)
A tokenization algorithm that splits words into frequently occurring subword units. Originally a data compression algorithm, adapted for neural machine translation in 2016. (Module 1)
Causal / Masked Self-Attention
A self-attention variant where tokens can only attend to previous tokens, not future ones. Used in GPT-style decoder-only models. (Module 1)
CBOW (Continuous Bag-of-Words)
A Word2Vec training method that predicts a center word from its surrounding context words. (Module 1)
Contextual Embeddings
Word representations that change based on surrounding context, unlike the static embeddings produced by Word2Vec. Created during the transformer’s training process. (Module 1)
Cosine Similarity
A common measure of similarity between two vectors, used to find related embeddings in vector search. (Modules 1, 6)
Cross-Attention
An attention mechanism in the decoder that attends to the encoder’s output representations, allowing the decoder to “look at” the source input. (Module 1)
One-hot Encoding
An early NLP word representation method where each word is a sparse binary vector with a single 1. Replaced by dense embeddings. (Module 1)
Self-Attention
An attention mechanism where each token in a sequence attends to all other tokens in the same sequence. The primary building block of transformers. (Module 1)
Sentence Transformer
A model that creates embeddings for entire sentences rather than individual words. Used for semantic search and RAG retrieval (e.g., all-MiniLM-L6-v2 with a 384-dimensional vector space). (Module 6)
Skip-gram
A Word2Vec training method that predicts surrounding context words from a center word. (Module 1)
Token
The basic unit of input and output for a language model — a subword piece of text. API costs are measured in tokens. (Modules 1, 2)
Tokenization
The process of converting raw text into a sequence of numerical tokens for model input. Different models use different tokenizers. (Modules 1, 2)
Vector Arithmetic
Mathematical operations on word embeddings that capture semantic relationships (e.g., king − man + woman ≈ queen). (Module 1)
Vector Space
The multi-dimensional mathematical space in which embeddings are placed, where similar concepts are geometrically close. (Module 1)
Word Embeddings
Dense numerical representations of words in a multi-dimensional space where semantically similar words are geometrically close. (Module 1)
Word2Vec
A 2013 Google Research technique for learning word embeddings using neural networks; introduced the Skip-gram and CBOW training methods. (Module 1)

Sampling and Generation Parameters

Constrained Decoding
A technique where the next token is dynamically filtered to only allow tokens that keep the output in a valid state (e.g., valid JSON). The mechanism behind Structured Outputs. (Module 2)
Context Window
The maximum number of tokens a model can process in a single request, including both the input (conversation history, system prompt) and the generated response. (Modules 1, 2)
Negative Prompt
In image generation, text telling the model what to avoid in the output. Common in Stable Diffusion workflows. (Module 4)
Seed
An integer used to initialize image generation from a specific random noise state. Using the same seed with the same prompt reproduces the same output. (Module 4)
Strength Parameter
In image-to-image generation (range 0.0–1.0), controls how much the original image influences the output versus the new prompt. (Module 4)
Temperature
A parameter controlling randomness in token generation (range 0.0–1.0+). Lower values produce more deterministic outputs; higher values produce more creative or varied outputs. (Modules 1, 2)
top_k
A sampling strategy that restricts the next token candidates to the top k tokens by probability. (Modules 1, 2)
top_p (Nucleus Sampling)
A sampling strategy that only considers tokens whose cumulative probability sum is below a threshold p, dynamically adjusting the candidate pool. (Modules 1, 2)

Named Models and Model Families

Claude
Anthropic’s family of closed-source LLMs. (Modules 2, 8)
DeepSeek
A Chinese AI company and model family; DeepSeek MoE is a widely used open MoE variant. (Module 5)
FLUX / FLUX.1
State-of-the-art open image generation model from Black Forest Labs, using the MMDiT architecture. (Module 4)
Gemini / Gemini Flash
Google’s closed-source multimodal model family. (Modules 2, 4)
Gemma / Gemma 3
Google’s open-weight model family. (Modules 4, 6)
GPT-2
OpenAI’s 2019 model (1.5B parameters), trained on WebText. Demonstrated strong zero-shot performance and was initially withheld due to safety concerns. (Modules 1, 2)
GPT-3
OpenAI’s 2020 model (175B parameters) with strong few-shot learning, accessed via API. (Module 2)
GPT-3.5 / ChatGPT
OpenAI’s instruction-tuned model launched November 2022. Reached 1 million users in 5 days. (Module 2)
InstructGPT
GPT-3 fine-tuned with RLHF to follow instructions; the key innovation that led to ChatGPT. (Module 2)
Llama / LLaMA
Meta’s open-weight model family; Llama 1 (2023, 7B–65B params), Llama 2 (first released for commercial use). (Module 2)
Midjourney
A closed-source image generation model known for high artistic quality. (Module 4)
Mistral / Mixtral
Mistral AI’s model family; Mixtral 8×7B is a popular open-source MoE model. (Module 5)
Nemotron
NVIDIA’s open-source model family, including MoE variants. (Modules 2, 5)
o1 / o3
OpenAI’s reasoning/thinking models that use hidden “thinking tokens” before producing a visible answer. (Module 6)
OLMo
A fully open-source model from AI2 (Allen Institute for AI) where both the weights and training data are publicly available. (Module 2)
Phi
Microsoft’s family of Small Language Models (SLMs), including MoE variants. (Module 5)
Qwen / Qwen2.5
Alibaba’s open-weight model family, available in various sizes. (Modules 2, 5, 6, 7)
Stable Diffusion (SD 1.5, SDXL, SD3)
Stability AI’s open-source text-to-image diffusion model, with multiple versions improving resolution and quality. (Module 4)
Switch Transformer
Google’s 2022 MoE model that simplified routing to a single expert per token. (Module 5)

APIs, Protocols, and Specifications

Chat Template
A structured format for distinguishing speakers in a conversation (system, user, assistant). Different model families use different formats (e.g., ChatML, Llama’s template). (Module 2)
ChatML
A chat template format using <|im_start|> and <|im_end|> tokens to delimit speaker turns. Used by GPT-3.5 and others. (Module 2)
Function Calling / Tool Calling
An OpenAI API feature (June 2023) that allows models fine-tuned for tool use to return structured JSON specifying which function to call and with what arguments. (Module 3)
JSON Mode
An earlier OpenAI API feature (November 2023) guaranteeing that output is valid JSON, but without enforcing a specific schema. Superseded by Structured Outputs. (Module 2)
JSON-RPC 2.0
The underlying remote procedure call protocol used by MCP servers. (Module 3)
MCP (Model Context Protocol)
A standard interface for AI tools released by Anthropic in November 2024. Functions like a USB standard for AI peripherals — implementations are called “MCP servers.” Uses JSON-RPC 2.0. (Module 3)
OpenAI Chat Completions API
The dominant LLM API format, using a /chat/completions endpoint. Adopted by many providers as a de facto standard. (Module 2)
OpenAI Responses API
A newer OpenAI API that replaced the Assistants API, introduced alongside the OpenAI Agents SDK. (Module 3)
SSE (Server-Sent Events)
A unidirectional HTTP protocol used to stream tokens from a server to a client as they are generated, enabling the “typewriter effect” in chat interfaces. (Module 2)
Structured Outputs
An API feature (OpenAI, August 2024) that guarantees model output matches a specified JSON schema exactly, using constrained decoding. (Module 2)
System Prompt
The first message in a conversation that sets the model’s role, behavior, and constraints. (Modules 2, 6)
Token Streaming
Delivering model output tokens to the client incrementally as they are generated, rather than waiting for the complete response. (Module 2)

Agents and Multi-Agent Systems

Agent Router
An agent design pattern that receives a request and hands it off to the appropriate specialized sub-agent. (Module 3)
AI Agent
An AI system that is goal-driven, autonomous, reactive, persistent, and capable of interacting with external systems and other agents. (Module 3)
AutoGen
Microsoft’s multi-agent framework, available in Python with .NET support forthcoming. (Module 3)
Code Interpreter
An agent tool that enables an AI to write and execute code on the fly within a sandboxed environment. (Module 3)
Computer Use
An agent capability allowing an AI to interact with a computer’s graphical user interface. (Module 3)
Crew.ai
A popular commercial Python framework for building multi-agent AI systems. (Module 3)
Guardrails
Safety constraints applied to agent inputs and outputs to prevent undesirable behavior. (Modules 3, 8)
Handoff
In a multi-agent system, the transfer of control from one agent to another for a specific task. (Module 3)
Human-in-the-Loop
A design pattern requiring human approval or review before an agent takes certain actions, particularly irreversible or high-stakes ones. (Modules 3, 8)
LangChain
An early and influential Python framework for building LLM applications; the basis for LangGraph. (Module 3)
LangGraph
A Python agent framework built on LangChain; one of the first frameworks supporting stateful, graph-based agent workflows. (Module 3)
Long-term Memory (Agent)
Persistent agent memory that survives beyond a single conversation. Types include factual, episodic, and procedural memory. (Module 3)
mem0
An open-source library for implementing long-term agent memory. (Module 3)
Microsoft Semantic Kernel
Microsoft’s agent SDK supporting Python, .NET, and Java. (Module 3)
OpenAI Agents SDK
A framework announced March 2025 for building multi-agent systems in Python and TypeScript. Supports function calling, handoffs, tracing, and session management. (Module 3)
Orchestrator
An agent design pattern that uses other agents as tools, delegating subtasks and aggregating results. (Module 3)
Parallel Agents
An agent design pattern that calls multiple agents simultaneously and aggregates their results. (Module 3)
Session
The OpenAI Agents SDK’s mechanism for maintaining short-term memory (conversation history) across agent calls. (Module 3)
Short-term Memory (Agent)
Stores and retrieves the current conversation thread; typically implemented as a session in agent SDKs. (Module 3)
Tracing
Built-in recording of agent generations, tool calls, handoffs, and other events for debugging and auditing purposes. (Module 3)

RAG and Context Techniques

Chunking
Splitting large documents into smaller pieces before embedding for use in RAG. Strategies range from fixed-size splits to sentence-aware and semantic chunking. (Module 6)
Context Injection
Taking retrieved information and inserting it into the model’s system prompt before making an API call. The “generation” step in RAG. (Module 6)
FAISS
Meta’s fast in-memory vector index library, widely used for similarity search in RAG systems. (Module 6)
Milvus
An open-source vector database capable of handling billions of embeddings at scale. (Module 6)
pgvector
A PostgreSQL extension for storing and querying vector embeddings directly in a Postgres database. (Module 6)
Pinecone
A popular managed vector database offered as a cloud service. (Module 6)
Qdrant
An open-source dedicated vector database written in Rust. (Module 6)
RAG (Retrieval-Augmented Generation)
A technique to reduce hallucinations by retrieving relevant external documents and injecting them into the model’s context before generating a response. Term coined in 2020. (Modules 3, 6, 8)
Semantic Chunking
A high-quality chunking strategy that groups sentences by embedding similarity and splits the text where the meaning changes significantly. (Module 6)
sqlite-vec
A SQLite extension that adds vector embedding storage and search capabilities. (Module 6)
Text-to-SQL
A technique where the model converts a natural language question into a SQL query to retrieve structured data. A form of context injection. (Module 6)
Vector Store / Vector Database
A database that stores vector embeddings and enables efficient similarity search. The retrieval component in a RAG pipeline. (Modules 3, 6)

Quantization and Model Formats

bf16 (bfloat16)
A 16-bit floating-point format (“brain float”) used in training and for LoRA adapter matrices in QLoRA. (Module 7)
FP16 / FP32
16-bit and 32-bit floating-point formats. Higher precision, higher memory usage. (Modules 5, 7)
GGML (Georgi Gerganov Machine Learning)
A C/C++ library and custom binary format for CPU-based LLM inference that helped democratize local model access. Superseded by GGUF. (Module 5)
GGUF (GPT-Generated Unified Format)
The replacement for GGML, adding extensibility, better metadata, single-file architecture, and support for offloading selected layers to GPU or NPU. The standard format for llama.cpp-based inference. (Module 5)
GPTQ (GPT Quantization)
One of the first widely adopted methods for aggressive 4-bit post-training quantization. CUDA-only; distributed via Hugging Face. (Module 5)
INT8
An 8-bit integer quantization format used to reduce model memory footprint. (Module 5)
K-Quant Strategy
A mixed quantization strategy in GGUF where different model layers are quantized at different bit depths based on their sensitivity. Common variants include Q4_K_M and Q5_K_S. (Module 5)
NF4 Format
4-bit NormalFloat format used to store base model weights in QLoRA, reducing memory requirements during fine-tuning. (Module 7)
ONNX (Open Neural Network eXchange)
A model interchange format created by Microsoft and Facebook in 2017 for portability between ML frameworks. Uses protobuf serialization. (Module 5)
Quantization
The process of reducing the numerical precision of model weights (e.g., from 16-bit floats to 4-bit integers) to reduce memory usage and speed up inference, with a modest accuracy tradeoff. (Modules 5, 7)
Safetensors
A tensor storage format used by Hugging Face and Apple MLX; designed to be safe and fast to load. (Module 5)

Hardware and Compute

ANE (Apple Neural Engine)
Apple’s on-device NPU for accelerating CoreML workloads on iPhone, iPad, and Apple Silicon Macs. (Module 5)
CUDA (Compute Unified Device Architecture)
NVIDIA’s GPU programming platform, launched in 2006. The de facto standard for deep learning, including libraries like cuBLAS and cuDNN. (Module 5)
DGX Spark
An NVIDIA desktop workstation with a GB10 chip and 128GB of unified memory, launched in 2025. (Module 5)
GPU (Graphics Processing Unit)
A massively parallel processor essential for training and inference of neural networks. (Modules 0, 1, 5)
GPGPU (General Purpose GPU)
Using GPU hardware for non-graphics computational workloads such as machine learning. Enabled by CUDA. (Module 5)
Metal / MPS (Metal Performance Shaders)
Apple’s low-level GPU API. MPS added optimized primitives for neural network operations in 2017. (Module 5)
MLX
Apple’s open-source ML framework (released December 2023) designed for Apple Silicon. Provides a NumPy/PyTorch-like Python API using the Metal GPU backend. (Module 5)
NPU (Neural Processing Unit)
A specialized processor optimized for neural network operations at lower power consumption than a GPU. Common in smartphones and edge devices. (Module 5)
NVLink
NVIDIA’s high-bandwidth interconnect used to connect multiple GPUs in a server or workstation. (Module 5)
ROCm (Radeon Open Compute)
AMD’s open-source alternative to CUDA, including rocBLAS. Currently Linux-only. (Module 5)
SIMD (Single Instruction Multiple Data)
A CPU instruction set feature for performing the same operation on multiple data elements simultaneously. Used by llama.cpp for CPU inference optimization. (Module 5)
SoC (System on a Chip)
An integrated circuit combining CPU, GPU, and other components on a single chip. Apple Silicon is a prominent example. (Module 5)
TFLOPS (Tera Floating-Point Operations Per Second)
A measure of a processor’s compute performance. 1 TFLOPS = 1 trillion FP32 operations per second. (Module 5)
TOPS (Tera Operations Per Second)
A measure of processor performance for integer or mixed-precision operations. Common for comparing NPUs. (Module 5)
TPU (Tensor Processing Unit)
Google’s custom AI accelerator, available for free use in Google Colab. (Modules 1, 5)
Unified Memory
A memory architecture shared between the CPU and GPU on the same chip (e.g., Apple Silicon, NVIDIA DGX Spark). Enables larger models than discrete VRAM but at lower bandwidth. (Module 5)
VRAM (Video RAM)
The dedicated memory on a discrete GPU. A key constraint for running large models — the model must generally fit within available VRAM. (Module 5)
WebAssembly (WASM)
A portable binary instruction format enabling near-native performance in web browsers; used to run small ML models client-side. (Module 5)
WebGPU
A web standard for GPU-accelerated computation in the browser. Used by WebLLM and Transformers.js for in-browser LLM inference. (Module 5)

Inference Frameworks and Tools

LiteLLM
An open-source tool providing a unified OpenAI-compatible API interface across multiple LLM providers. (Module 2)
llama.cpp
A C/C++ library for CPU and GPU inference of GGUF models. Includes a CLI, web UI, and OpenAI-compatible API server. Released March 2023 by Georgi Gerganov. (Module 5)
llama-cpp-python
A Python binding for llama.cpp with an OpenAI-compatible API. (Module 5)
LLamaSharp
A C# binding for llama.cpp, installed via NuGet. Supports CPU, CUDA, and Vulkan backends. (Module 5)
LM Studio
A desktop GUI application that wraps llama.cpp, providing a model browser, built-in chat interface, and a local API server. (Modules 5, 7)
Ollama
A simple CLI tool wrapping llama.cpp, using a Modelfile for configuration. Provides a curated model library. (Module 5)
vLLM
A high-performance, OpenAI-compatible LLM inference server optimized for production deployments. (Module 2)
WebLLM
A JavaScript library for in-browser LLM inference using WebGPU. Requires models in MLC format. (Module 5)
Wllama
A JavaScript library for in-browser CPU-only inference using GGUF models. (Module 5)
Transformers.js
Hugging Face’s JavaScript equivalent of the transformers library. Uses ONNX format and runs models directly in the browser. (Module 5)

Prompt Engineering

Chain-of-Thought (CoT)
A prompt engineering technique that asks the model to “think step by step” before answering. Shown to dramatically improve reasoning and math performance (Google, 2022). (Modules 6, 8)
Few-shot Learning / Examples
Providing 2–5 input/output examples in the prompt to guide the model toward a desired format or behavior. (Modules 2, 6)
Negative Samples (Prompting)
Including examples of what the model should not do alongside positive examples in the prompt. (Module 6)
Prompt Engineering
The practice of carefully crafting model inputs to guide outputs. Techniques include few-shot examples, chain-of-thought prompting, role assignment, and negative samples. (Module 6)
Role / Persona Assignment
Adding a role or persona to the system prompt to guide the model’s tone, style, and perspective. (Module 6)
Zero-shot
The model’s ability to perform a task with no examples provided in the prompt, relying entirely on knowledge from pretraining. (Module 1)

Reasoning Models

Reasoning / Thinking Models
Models fine-tuned to produce a “thinking” phase before their final answer, giving them a scratch space for exploration and self-correction. (Module 6)
Thinking Tokens
Tokens the model uses to reason before producing its visible answer. OpenAI’s o1/o3 use hidden thinking tokens; many open-weight models use visible <think>/</think> delimiters. (Module 6)

Evaluation

Dataset Contamination
When benchmark questions appear in a model’s training data, inflating benchmark scores through memorization rather than genuine capability. (Module 6)
Evals (Evaluations)
Benchmarks and test suites used to measure model capabilities, track progress over time, and detect regressions. (Module 6)
GPQA (Graduate-Level Google-Proof Q&A)
A PhD-level scientific reasoning benchmark in biology, physics, and chemistry. Designed to be unsolvable via web search alone. (Module 6)
HLE (Humanity’s Last Exam)
A benchmark of 2,500 expert-level questions requiring multimodal, multi-step reasoning. Created by CAIS and Scale AI. (Module 6)
LLM as a Judge
Using a separate LLM to evaluate the outputs of another LLM for quality, safety, or accuracy. (Module 8)
MMLU (Massive Multitask Language Understanding)
A multi-domain multiple-choice benchmark covering STEM, humanities, and more. Published in 2021; now largely saturated by top models. (Module 6)
MMLU-Pro
A harder version of MMLU with 12,000 questions across 14 subjects. Released June 2024 to address model saturation of the original MMLU. (Module 6)
Red-Teaming
Adversarial testing of models to systematically find failure modes, biases, and safety vulnerabilities. (Modules 6, 8)
SOTA (State of the Art)
The best-performing result on a given benchmark or task at a given point in time. (Module 1)
SWE-Bench
A benchmark testing AI ability to resolve real GitHub issues from open-source Python repositories. Created in 2023 by the Princeton NLP group. (Module 6)
W&B / Weights & Biases
An ML experiment tracking platform for monitoring training metrics (loss, accuracy, GPU utilization) across runs. (Module 7)

Multimodal and Image Generation Concepts

AnimateDiff
A video generation extension of the diffusion model framework. (Module 4)
ControlNet
An architecture (Stanford, February 2023) that adds spatial control to diffusion models by creating a trainable copy of the U-Net encoder to accept conditioning inputs (depth maps, pose skeletons, edge maps) while keeping original model weights frozen. (Module 4)
Depth Map
An image where pixel values represent distance from the viewer. Used as a spatial control input for image generation. (Module 4)
Denoising / Reverse Diffusion
The inference phase of a diffusion model: starting from pure random noise and iteratively removing noise, guided by a text prompt, to produce a coherent image. (Module 4)
Forward Diffusion Process
The training phase of a diffusion model: progressively adding random noise to real images. The model learns to predict what noise was added at each step. (Module 4)
Image-to-Image
A model capability that generates a modified image from an existing image and a text prompt, using partial denoising of the source image. (Module 4)
Inpainting
Filling in missing or masked regions of an image in a realistic way, steered by the surrounding context and a text prompt. (Module 4)
OpenPose
A human pose estimation model whose output skeleton can be used as a conditioning input for ControlNet. (Module 4)
Outpainting
Extending an image beyond its original borders by treating the new region as a masked area and applying inpainting. (Module 4)
Prompt Upsampling
An image generation feature that runs a short prompt through an LLM to make it more detailed and descriptive before passing it to the image model. (Module 4)
Safety Classifier / Safety Tolerance
Separate classifier models that run alongside generative image models to filter harmful prompts (input filtering) or flag unsafe generated images (output filtering). (Module 4)
Super Resolution
An image-to-image task of increasing the resolution and detail of an existing image. (Module 4)
Style Transfer
An image-to-image task of recreating an image in a different artistic style. (Module 4)
Text-to-Image
A model capability that generates an image from a natural language text prompt using a diffusion process. (Module 4)

Tools, Platforms, and Services

Gradio
A Python library for rapidly building web UIs for ML demos. Supports text, images, audio, and streaming. Acquired by Hugging Face in 2021. (Modules 2, 3)
Google Colab
A cloud-based Jupyter notebook environment with free GPU and TPU access. (Module 1)
Hugging Face
The central platform for sharing AI models, datasets, and demos. Often described as “GitHub for AI models.” (Modules 2–7)
Hugging Face Datasets
Hugging Face’s repository of public training datasets, stored in Parquet format. Used for uploading fine-tuning data. (Module 7)
Hugging Face Spaces
Free cloud hosting for ML demos, supporting Gradio, Streamlit, and Docker. (Module 3)
Hugging Face Transformers Library
An open-source Python library providing unified access to thousands of pre-trained transformer models across PyTorch, TensorFlow, and JAX. (Module 2)
HF Pipelines
A high-level abstraction in the Hugging Face Transformers library that simplifies model usage with a standardized API across task types. (Module 4)
JAX
Google’s numerical computing library, used as an alternative backend for Hugging Face Transformers. (Module 2)
OpenRouter
An inference provider offering a unified OpenAI-compatible API to hundreds of models from OpenAI, Anthropic, Google, Meta, and others. Pay-per-call pricing. (Module 2)
PyTorch
The dominant deep learning framework for research and production, used by most Hugging Face models. (Modules 2, 5)
Replicate
A model hosting platform focused on image and video models. Offers pay-per-call pricing and supports fine-tuning. (Module 4)
TensorFlow
Google’s deep learning framework; an alternative to PyTorch. (Modules 2, 5)
tiktoken
OpenAI’s tokenization library for estimating token counts. (Module 2)

Ethics, Safety, and Intellectual Property

Adversarial Optimization (Jailbreak)
A jailbreaking technique that uses gradient-based optimization to find token sequences that reliably bypass a model’s safety guardrails. (Module 8)
Bias and Fairness
The reflection of societal biases present in training data into model outputs. A mathematical inevitability when training data reflects historical inequities. (Module 8)
C2PA (Coalition for Content Provenance and Authenticity)
An organization developing open standards for cryptographically signing media at the point of creation to verify its authenticity and origin. (Module 8)
Capability Bounding
Limiting a model’s scope and capabilities via fine-tuning, alignment, or system prompting to prevent unintended behaviors. (Module 8)
Confidence Calibration
Training models to express appropriate uncertainty rather than stating everything with equal, unwarranted conviction. (Module 8)
Copyright / Fair Use
The ongoing legal question of whether training AI models on copyrighted works constitutes fair use or infringement. (Module 8)
DAN (Do Anything Now)
An early ChatGPT jailbreak prompt that triggered an alter-ego mode, bypassing safety restrictions. (Module 8)
Deepfake
AI-generated synthetic media — text, image, video, or audio — used to impersonate a real person or deceive an audience. (Module 8)
EU AI Act
European regulation governing AI systems, including provisions for high-risk domain oversight, environmental documentation, and mandatory labeling of deepfakes. (Module 8)
Explainability / Black Box
The difficulty of understanding or auditing a neural network’s internal reasoning. Raises ethical concerns in high-stakes domains like healthcare, law, and defense. (Module 8)
Fiction Framing Attack
A jailbreaking technique that wraps a harmful request inside a fictional storytelling context to bypass safety guardrails. (Module 8)
Hallucination
When a model generates plausible-sounding but factually incorrect information. Not a traditional software bug but a consequence of stochastic next-token prediction. (Modules 1, 6, 7)
Input / Output Filtering
Safety classifier layers that analyze prompts before they reach the model and/or screen model responses before they are returned to the user. (Modules 4, 8)
Jailbreaking
Attempts to bypass a model’s safety guardrails and system prompt constraints to produce disallowed content. (Module 8)
Model Card
A documentation file (typically README.md) accompanying a Hugging Face model, describing its training process, benchmark results, intended uses, limitations, and risks. (Module 7)
Open Weights
A model distribution where the trained weight files are publicly downloadable, but the training data and full training code are not. Enables local deployment and fine-tuning. (Module 2)
Prompt Injection
An attack where malicious content in the model’s input attempts to override the system prompt or hijack the model’s instructions. (Module 8)
Vibe Hacking
Anthropic’s term for AI-assisted automation of large portions of a cybercrime campaign, lowering the barrier for sophisticated attacks. (Module 8)
Voice Cloning
An AI technique that replicates a person’s voice from a small audio sample, raising significant impersonation and consent concerns. (Module 8)

Data and Training Infrastructure

AdamW
An optimizer variant commonly used for fine-tuning LLMs. (Module 7)
Adapter
The small set of trained LoRA matrices (A and B) that encode a behavioral change. Can be kept as a separate file or merged into the base model weights. (Module 7)
Checkpoint
A saved snapshot of model weights at a point during training, allowing training to be resumed or a specific point in training to be evaluated. (Module 7)
JSONL (JSON Lines)
A file format where each line is a valid JSON object. The standard format for fine-tuning datasets. (Module 6)
Model Weights
The numerical parameters of a trained model — the “knowledge” encoded during the training process. (Modules 2, 7)
Parameters
The individual numerical values in a model’s weight matrices. Model size is commonly expressed in billions of parameters (e.g., 7B, 70B). (Modules 1, 2, 5)
Parquet
A columnar data storage format used by Hugging Face Datasets. (Module 7)
Test Set
A held-out portion of data (~10–15%) used only after training is complete to provide an unbiased final performance measure. (Module 6)
Training Set
The largest portion of data (~70–80%) that the model directly learns from during fine-tuning. (Module 6)
Validation Set
A held-out portion of data (~10–15%) used during training to monitor generalization and detect overfitting. (Module 6)
WebText
The dataset of 8 million web pages (~40GB of text) used to train GPT-2. (Module 1)