Glossary

Key terms, concepts, and acronyms from CS-394/594: How Generative AI Works.

Architectures and Model Types

ALM (Audio-Language Model): A multimodal model that processes both audio and text together. (Module 4)
Autoregressive Generation: A text generation approach that produces one token at a time, feeding each output back as input for the next prediction. Because of this, the same prompt can produce different outputs. (Module 1)
CLIP: OpenAI’s vision encoder trained on 400 million image-text pairs. Creates a shared embedding space between images and text, making it the foundation of most VLMs. (Module 4)
CNN (Convolutional Neural Network): A classic neural network architecture for image tasks, based on convolution operations. Historically used for image classification before Vision Transformers became dominant. (Module 4)
Decoder-only Architecture: A transformer variant where self-attention is causal/masked — tokens can only attend to previous tokens, not future ones. The basis of GPT-style models. (Module 1)
DINO / DINOv2: Meta’s self-supervised vision transformer (“Self DIstillation with NO Labels”), trained on 142 million images without labels. (Module 4)
Diffusion Model: An image generation architecture inspired by thermodynamics. During training, noise is progressively added to images. During inference, the model starts from random noise and iteratively removes it, guided by a text prompt. (Module 4)
Encoder-Decoder (Seq2Seq): A transformer variant with both an encoder (which generates contextual representations via self-attention) and a decoder (which generates output tokens one at a time, using cross-attention to the encoder’s output). Used in translation models. (Module 1)
FastVLM / FastViTHD: Apple’s efficient Vision Language Model combining transformers and convolutional layers, optimized for on-device real-time performance. (Module 4)
GPT (Generative Pre-trained Transformer): A decoder-only transformer architecture pre-trained on next-token prediction. The basis for models like ChatGPT. (Modules 1, 2)
LLaVA: An influential open-source Vision Language Model developed by University of Wisconsin-Madison and Microsoft Research. (Module 4)
LLM (Large Language Model): A large-scale neural network trained on massive text corpora to understand and generate human language. The primary focus of this course. (Modules 0–8)
MMDiT (Multimodal Diffusion Transformer): The architecture used by FLUX image generation models. Processes text and image tokens together in a unified transformer, replacing the U-Net architecture used in Stable Diffusion. (Module 4)
MoE (Mixture of Experts): A neural network architecture with multiple “expert” sub-networks and a routing layer that activates only a subset of experts for each input token. Enables larger effective model size while keeping active compute low. (Module 5)
RNN (Recurrent Neural Network): An older sequence modeling architecture superseded by the Transformer for NLP tasks. (Module 1)
SLM (Small Language Model): Smaller language models designed to run on local or consumer-grade hardware. (Modules 0, 5, 6)
Swin Transformer: Microsoft’s vision transformer using a “shifted window” attention strategy; excels at dense prediction tasks like object detection and segmentation. (Module 4)
Transformer: The neural network architecture introduced in the 2017 paper “Attention Is All You Need.” Eliminated the need for RNNs in sequence tasks by using attention mechanisms. The foundation of virtually all modern LLMs. (Modules 1, 2, 4)
U-Net: A neural network architecture with an encoder-decoder structure used in Stable Diffusion for image generation. (Module 4)
ViT (Vision Transformer): A transformer applied to images by dividing them into 16×16 patches and treating each patch as a token. Introduced in “An Image is Worth 16×16 Words.” (Module 4)
VLM (Vision Language Model): A multimodal model combining a vision encoder, an adapter/projector layer, and a language model. Enables image-and-text-to-text tasks. (Modules 4, 6)

Training Concepts and Techniques

Alignment: Post-training refinement that shapes a model toward preferred behaviors and values. Includes techniques like RLHF and Constitutional AI. (Modules 7, 8)
Back Propagation: The algorithm for computing gradients through a neural network and updating model weights to reduce loss. (Module 7)
Batch Size: The number of training examples processed together in a single forward/backward pass. (Module 7)
Constitutional AI: Anthropic’s alignment approach that uses a written set of principles to guide model behavior during training. (Module 8)
Data Poisoning / Watermarking (Glaze / Nightshade): Techniques developed at UChicago that add adversarial perturbations to images, either to degrade AI training quality or to embed protective watermarks. (Module 8)
Distillation (Knowledge Distillation): Training a smaller model using the outputs of a larger, more capable model. Also used maliciously to extract capabilities from commercial models. (Modules 6, 8)
Epoch: One complete pass through the entire training dataset. (Module 7)
Expert Collapse: A failure mode in MoE training where a few experts handle nearly all tokens and the rest go largely unused. (Module 5)
Fine-tuning: Continuing to train a pre-trained model on a smaller, curated dataset to adapt it to a specific task, style, or behavior. (Modules 6, 7)
FrankenMoE / MoErge: A community approach to creating MoE models by combining the FFN layers of multiple specialized models (e.g., math, coding, chat) into a single model with a new router network. (Module 5)
Gradient Accumulation: Accumulating gradients over multiple mini-batches before taking an optimizer step. A memory-efficient way to simulate a larger effective batch size. (Module 7)
Instruction-tuning: Fine-tuning a base model on large datasets of question/answer pairs and task-completion examples to make it follow instructions and behave as a helpful assistant. (Module 2)
Learning Rate: A hyperparameter controlling how large a step the optimizer takes when updating weights. Too high causes instability; too low causes slow convergence. (Module 7)
LoRA (Low-Rank Adaptation): A parameter-efficient fine-tuning method that freezes the base model weights and introduces two small trainable matrices (A and B) whose product captures the desired behavioral change. Results in a small, portable “adapter.” (Module 7)
Overfitting: When a model memorizes training data rather than learning to generalize. Detected by validation loss increasing while training loss continues to decrease. (Module 7)
Pretraining: The initial large-scale training run that teaches the model language by predicting the next token over a massive text corpus. Produces the “base model.” (Modules 2, 7)
QLoRA (Quantized LoRA): A variant of LoRA that quantizes the base model weights to 4-bit (NF4 format) to reduce memory usage during fine-tuning, while keeping the adapter matrices at full precision. (Module 7)
RLHF (Reinforcement Learning from Human Feedback): A training technique where human raters rank different model responses, training a reward model that guides further fine-tuning. Used to create InstructGPT and ChatGPT. (Modules 2, 3)
Supervised Fine-Tuning (SFT): Fine-tuning a pre-trained model on a curated labeled dataset. The standard first step in aligning a base model to follow instructions. (Modules 2, 6, 7)
Synthetic Data / Data Distillation: Generating training examples by prompting a more capable model (e.g., “Generate 100 examples of a student asking a teacher a geography question”). (Module 6)
Upcycling: A MoE training strategy that starts from an existing dense model, replicates its layers into multiple experts, and adds router networks. Faster and more stable than training a MoE from scratch. (Module 5)
Validation Loss: Model performance measured on a held-out dataset not used during training. Used to detect overfitting. (Module 7)

Embeddings, Tokenization, and Attention

Attention Mechanism: The core mechanism in transformers that allows each token to weigh the importance of every other token in the sequence when building its representation. (Module 1)
BPE (Byte Pair Encoding): A tokenization algorithm that splits words into frequently occurring subword units. Originally a data compression algorithm, adapted for neural machine translation in 2016. (Module 1)
Causal / Masked Self-Attention: A self-attention variant where tokens can only attend to previous tokens, not future ones. Used in GPT-style decoder-only models. (Module 1)
CBOW (Continuous Bag-of-Words): A Word2Vec training method that predicts a center word from its surrounding context words. (Module 1)
Contextual Embeddings: Word representations that change based on surrounding context, unlike the static embeddings produced by Word2Vec. Created during the transformer’s training process. (Module 1)
Cosine Similarity: A common measure of similarity between two vectors, used to find related embeddings in vector search. (Modules 1, 6)
Cross-Attention: An attention mechanism in the decoder that attends to the encoder’s output representations, allowing the decoder to “look at” the source input. (Module 1)
One-hot Encoding: An early NLP word representation method where each word is a sparse binary vector with a single 1. Replaced by dense embeddings. (Module 1)
Self-Attention: An attention mechanism where each token in a sequence attends to all other tokens in the same sequence. The primary building block of transformers. (Module 1)
Sentence Transformer: A model that creates embeddings for entire sentences rather than individual words. Used for semantic search and RAG retrieval (e.g., all-MiniLM-L6-v2 with a 384-dimensional vector space). (Module 6)
Skip-gram: A Word2Vec training method that predicts surrounding context words from a center word. (Module 1)
Token: The basic unit of input and output for a language model — a subword piece of text. API costs are measured in tokens. (Modules 1, 2)
Tokenization: The process of converting raw text into a sequence of numerical tokens for model input. Different models use different tokenizers. (Modules 1, 2)
Vector Arithmetic: Mathematical operations on word embeddings that capture semantic relationships (e.g., king − man + woman ≈ queen). (Module 1)
Vector Space: The multi-dimensional mathematical space in which embeddings are placed, where similar concepts are geometrically close. (Module 1)
Word Embeddings: Dense numerical representations of words in a multi-dimensional space where semantically similar words are geometrically close. (Module 1)
Word2Vec: A 2013 Google Research technique for learning word embeddings using neural networks; introduced the Skip-gram and CBOW training methods. (Module 1)

Sampling and Generation Parameters

Constrained Decoding: A technique where the next token is dynamically filtered to only allow tokens that keep the output in a valid state (e.g., valid JSON). The mechanism behind Structured Outputs. (Module 2)
Context Window: The maximum number of tokens a model can process in a single request, including both the input (conversation history, system prompt) and the generated response. (Modules 1, 2)
Negative Prompt: In image generation, text telling the model what to avoid in the output. Common in Stable Diffusion workflows. (Module 4)
Seed: An integer used to initialize image generation from a specific random noise state. Using the same seed with the same prompt reproduces the same output. (Module 4)
Strength Parameter: In image-to-image generation (range 0.0–1.0), controls how much the original image influences the output versus the new prompt. (Module 4)
Temperature: A parameter controlling randomness in token generation (range 0.0–1.0+). Lower values produce more deterministic outputs; higher values produce more creative or varied outputs. (Modules 1, 2)
top_k: A sampling strategy that restricts the next token candidates to the top k tokens by probability. (Modules 1, 2)
top_p (Nucleus Sampling): A sampling strategy that only considers tokens whose cumulative probability sum is below a threshold p, dynamically adjusting the candidate pool. (Modules 1, 2)

Named Models and Model Families

Claude: Anthropic’s family of closed-source LLMs. (Modules 2, 8)
DeepSeek: A Chinese AI company and model family; DeepSeek MoE is a widely used open MoE variant. (Module 5)
FLUX / FLUX.1: State-of-the-art open image generation model from Black Forest Labs, using the MMDiT architecture. (Module 4)
Gemini / Gemini Flash: Google’s closed-source multimodal model family. (Modules 2, 4)
Gemma / Gemma 3: Google’s open-weight model family. (Modules 4, 6)
GPT-2: OpenAI’s 2019 model (1.5B parameters), trained on WebText. Demonstrated strong zero-shot performance and was initially withheld due to safety concerns. (Modules 1, 2)
GPT-3: OpenAI’s 2020 model (175B parameters) with strong few-shot learning, accessed via API. (Module 2)
GPT-3.5 / ChatGPT: OpenAI’s instruction-tuned model launched November 2022. Reached 1 million users in 5 days. (Module 2)
InstructGPT: GPT-3 fine-tuned with RLHF to follow instructions; the key innovation that led to ChatGPT. (Module 2)
Llama / LLaMA: Meta’s open-weight model family; Llama 1 (2023, 7B–65B params), Llama 2 (first released for commercial use). (Module 2)
Midjourney: A closed-source image generation model known for high artistic quality. (Module 4)
Mistral / Mixtral: Mistral AI’s model family; Mixtral 8×7B is a popular open-source MoE model. (Module 5)
Nemotron: NVIDIA’s open-source model family, including MoE variants. (Modules 2, 5)
o1 / o3: OpenAI’s reasoning/thinking models that use hidden “thinking tokens” before producing a visible answer. (Module 6)
OLMo: A fully open-source model from AI2 (Allen Institute for AI) where both the weights and training data are publicly available. (Module 2)
Phi: Microsoft’s family of Small Language Models (SLMs), including MoE variants. (Module 5)
Qwen / Qwen2.5: Alibaba’s open-weight model family, available in various sizes. (Modules 2, 5, 6, 7)
Stable Diffusion (SD 1.5, SDXL, SD3): Stability AI’s open-source text-to-image diffusion model, with multiple versions improving resolution and quality. (Module 4)
Switch Transformer: Google’s 2022 MoE model that simplified routing to a single expert per token. (Module 5)

APIs, Protocols, and Specifications

Chat Template: A structured format for distinguishing speakers in a conversation (system, user, assistant). Different model families use different formats (e.g., ChatML, Llama’s template). (Module 2)
ChatML: A chat template format using <|im_start|> and <|im_end|> tokens to delimit speaker turns. Used by GPT-3.5 and others. (Module 2)
Function Calling / Tool Calling: An OpenAI API feature (June 2023) that allows models fine-tuned for tool use to return structured JSON specifying which function to call and with what arguments. (Module 3)
JSON Mode: An earlier OpenAI API feature (November 2023) guaranteeing that output is valid JSON, but without enforcing a specific schema. Superseded by Structured Outputs. (Module 2)
JSON-RPC 2.0: The underlying remote procedure call protocol used by MCP servers. (Module 3)
MCP (Model Context Protocol): A standard interface for AI tools released by Anthropic in November 2024. Functions like a USB standard for AI peripherals — implementations are called “MCP servers.” Uses JSON-RPC 2.0. (Module 3)
OpenAI Chat Completions API: The dominant LLM API format, using a /chat/completions endpoint. Adopted by many providers as a de facto standard. (Module 2)
OpenAI Responses API: A newer OpenAI API that replaced the Assistants API, introduced alongside the OpenAI Agents SDK. (Module 3)
SSE (Server-Sent Events): A unidirectional HTTP protocol used to stream tokens from a server to a client as they are generated, enabling the “typewriter effect” in chat interfaces. (Module 2)
Structured Outputs: An API feature (OpenAI, August 2024) that guarantees model output matches a specified JSON schema exactly, using constrained decoding. (Module 2)
System Prompt: The first message in a conversation that sets the model’s role, behavior, and constraints. (Modules 2, 6)
Token Streaming: Delivering model output tokens to the client incrementally as they are generated, rather than waiting for the complete response. (Module 2)

Agents and Multi-Agent Systems

Agent Router: An agent design pattern that receives a request and hands it off to the appropriate specialized sub-agent. (Module 3)
AI Agent: An AI system that is goal-driven, autonomous, reactive, persistent, and capable of interacting with external systems and other agents. (Module 3)
AutoGen: Microsoft’s multi-agent framework, available in Python with .NET support forthcoming. (Module 3)
Code Interpreter: An agent tool that enables an AI to write and execute code on the fly within a sandboxed environment. (Module 3)
Computer Use: An agent capability allowing an AI to interact with a computer’s graphical user interface. (Module 3)
Crew.ai: A popular commercial Python framework for building multi-agent AI systems. (Module 3)
Guardrails: Safety constraints applied to agent inputs and outputs to prevent undesirable behavior. (Modules 3, 8)
Handoff: In a multi-agent system, the transfer of control from one agent to another for a specific task. (Module 3)
Human-in-the-Loop: A design pattern requiring human approval or review before an agent takes certain actions, particularly irreversible or high-stakes ones. (Modules 3, 8)
LangChain: An early and influential Python framework for building LLM applications; the basis for LangGraph. (Module 3)
LangGraph: A Python agent framework built on LangChain; one of the first frameworks supporting stateful, graph-based agent workflows. (Module 3)
Long-term Memory (Agent): Persistent agent memory that survives beyond a single conversation. Types include factual, episodic, and procedural memory. (Module 3)
mem0: An open-source library for implementing long-term agent memory. (Module 3)
Microsoft Semantic Kernel: Microsoft’s agent SDK supporting Python, .NET, and Java. (Module 3)
OpenAI Agents SDK: A framework announced March 2025 for building multi-agent systems in Python and TypeScript. Supports function calling, handoffs, tracing, and session management. (Module 3)
Orchestrator: An agent design pattern that uses other agents as tools, delegating subtasks and aggregating results. (Module 3)
Parallel Agents: An agent design pattern that calls multiple agents simultaneously and aggregates their results. (Module 3)
Session: The OpenAI Agents SDK’s mechanism for maintaining short-term memory (conversation history) across agent calls. (Module 3)
Short-term Memory (Agent): Stores and retrieves the current conversation thread; typically implemented as a session in agent SDKs. (Module 3)
Tracing: Built-in recording of agent generations, tool calls, handoffs, and other events for debugging and auditing purposes. (Module 3)

RAG and Context Techniques

Chunking: Splitting large documents into smaller pieces before embedding for use in RAG. Strategies range from fixed-size splits to sentence-aware and semantic chunking. (Module 6)
Context Injection: Taking retrieved information and inserting it into the model’s system prompt before making an API call. The “generation” step in RAG. (Module 6)
FAISS: Meta’s fast in-memory vector index library, widely used for similarity search in RAG systems. (Module 6)
Milvus: An open-source vector database capable of handling billions of embeddings at scale. (Module 6)
pgvector: A PostgreSQL extension for storing and querying vector embeddings directly in a Postgres database. (Module 6)
Pinecone: A popular managed vector database offered as a cloud service. (Module 6)
Qdrant: An open-source dedicated vector database written in Rust. (Module 6)
RAG (Retrieval-Augmented Generation): A technique to reduce hallucinations by retrieving relevant external documents and injecting them into the model’s context before generating a response. Term coined in 2020. (Modules 3, 6, 8)
Semantic Chunking: A high-quality chunking strategy that groups sentences by embedding similarity and splits the text where the meaning changes significantly. (Module 6)
sqlite-vec: A SQLite extension that adds vector embedding storage and search capabilities. (Module 6)
Text-to-SQL: A technique where the model converts a natural language question into a SQL query to retrieve structured data. A form of context injection. (Module 6)
Vector Store / Vector Database: A database that stores vector embeddings and enables efficient similarity search. The retrieval component in a RAG pipeline. (Modules 3, 6)

Quantization and Model Formats

bf16 (bfloat16): A 16-bit floating-point format (“brain float”) used in training and for LoRA adapter matrices in QLoRA. (Module 7)
FP16 / FP32: 16-bit and 32-bit floating-point formats. Higher precision, higher memory usage. (Modules 5, 7)
GGML (Georgi Gerganov Machine Learning): A C/C++ library and custom binary format for CPU-based LLM inference that helped democratize local model access. Superseded by GGUF. (Module 5)
GGUF (GPT-Generated Unified Format): The replacement for GGML, adding extensibility, better metadata, single-file architecture, and support for offloading selected layers to GPU or NPU. The standard format for llama.cpp-based inference. (Module 5)
GPTQ (GPT Quantization): One of the first widely adopted methods for aggressive 4-bit post-training quantization. CUDA-only; distributed via Hugging Face. (Module 5)
INT8: An 8-bit integer quantization format used to reduce model memory footprint. (Module 5)
K-Quant Strategy: A mixed quantization strategy in GGUF where different model layers are quantized at different bit depths based on their sensitivity. Common variants include Q4_K_M and Q5_K_S. (Module 5)
NF4 Format: 4-bit NormalFloat format used to store base model weights in QLoRA, reducing memory requirements during fine-tuning. (Module 7)
ONNX (Open Neural Network eXchange): A model interchange format created by Microsoft and Facebook in 2017 for portability between ML frameworks. Uses protobuf serialization. (Module 5)
Quantization: The process of reducing the numerical precision of model weights (e.g., from 16-bit floats to 4-bit integers) to reduce memory usage and speed up inference, with a modest accuracy tradeoff. (Modules 5, 7)
Safetensors: A tensor storage format used by Hugging Face and Apple MLX; designed to be safe and fast to load. (Module 5)

Hardware and Compute

ANE (Apple Neural Engine): Apple’s on-device NPU for accelerating CoreML workloads on iPhone, iPad, and Apple Silicon Macs. (Module 5)
CUDA (Compute Unified Device Architecture): NVIDIA’s GPU programming platform, launched in 2006. The de facto standard for deep learning, including libraries like cuBLAS and cuDNN. (Module 5)
DGX Spark: An NVIDIA desktop workstation with a GB10 chip and 128GB of unified memory, launched in 2025. (Module 5)
GPU (Graphics Processing Unit): A massively parallel processor essential for training and inference of neural networks. (Modules 0, 1, 5)
GPGPU (General Purpose GPU): Using GPU hardware for non-graphics computational workloads such as machine learning. Enabled by CUDA. (Module 5)
Metal / MPS (Metal Performance Shaders): Apple’s low-level GPU API. MPS added optimized primitives for neural network operations in 2017. (Module 5)
MLX: Apple’s open-source ML framework (released December 2023) designed for Apple Silicon. Provides a NumPy/PyTorch-like Python API using the Metal GPU backend. (Module 5)
NPU (Neural Processing Unit): A specialized processor optimized for neural network operations at lower power consumption than a GPU. Common in smartphones and edge devices. (Module 5)
NVLink: NVIDIA’s high-bandwidth interconnect used to connect multiple GPUs in a server or workstation. (Module 5)
ROCm (Radeon Open Compute): AMD’s open-source alternative to CUDA, including rocBLAS. Currently Linux-only. (Module 5)
SIMD (Single Instruction Multiple Data): A CPU instruction set feature for performing the same operation on multiple data elements simultaneously. Used by llama.cpp for CPU inference optimization. (Module 5)
SoC (System on a Chip): An integrated circuit combining CPU, GPU, and other components on a single chip. Apple Silicon is a prominent example. (Module 5)
TFLOPS (Tera Floating-Point Operations Per Second): A measure of a processor’s compute performance. 1 TFLOPS = 1 trillion FP32 operations per second. (Module 5)
TOPS (Tera Operations Per Second): A measure of processor performance for integer or mixed-precision operations. Common for comparing NPUs. (Module 5)
TPU (Tensor Processing Unit): Google’s custom AI accelerator, available for free use in Google Colab. (Modules 1, 5)
Unified Memory: A memory architecture shared between the CPU and GPU on the same chip (e.g., Apple Silicon, NVIDIA DGX Spark). Enables larger models than discrete VRAM but at lower bandwidth. (Module 5)
VRAM (Video RAM): The dedicated memory on a discrete GPU. A key constraint for running large models — the model must generally fit within available VRAM. (Module 5)
WebAssembly (WASM): A portable binary instruction format enabling near-native performance in web browsers; used to run small ML models client-side. (Module 5)
WebGPU: A web standard for GPU-accelerated computation in the browser. Used by WebLLM and Transformers.js for in-browser LLM inference. (Module 5)

Inference Frameworks and Tools

LiteLLM: An open-source tool providing a unified OpenAI-compatible API interface across multiple LLM providers. (Module 2)
llama.cpp: A C/C++ library for CPU and GPU inference of GGUF models. Includes a CLI, web UI, and OpenAI-compatible API server. Released March 2023 by Georgi Gerganov. (Module 5)
llama-cpp-python: A Python binding for llama.cpp with an OpenAI-compatible API. (Module 5)
LLamaSharp: A C# binding for llama.cpp, installed via NuGet. Supports CPU, CUDA, and Vulkan backends. (Module 5)
LM Studio: A desktop GUI application that wraps llama.cpp, providing a model browser, built-in chat interface, and a local API server. (Modules 5, 7)
Ollama: A simple CLI tool wrapping llama.cpp, using a Modelfile for configuration. Provides a curated model library. (Module 5)
vLLM: A high-performance, OpenAI-compatible LLM inference server optimized for production deployments. (Module 2)
WebLLM: A JavaScript library for in-browser LLM inference using WebGPU. Requires models in MLC format. (Module 5)
Wllama: A JavaScript library for in-browser CPU-only inference using GGUF models. (Module 5)
Transformers.js: Hugging Face’s JavaScript equivalent of the transformers library. Uses ONNX format and runs models directly in the browser. (Module 5)

Prompt Engineering

Chain-of-Thought (CoT): A prompt engineering technique that asks the model to “think step by step” before answering. Shown to dramatically improve reasoning and math performance (Google, 2022). (Modules 6, 8)
Few-shot Learning / Examples: Providing 2–5 input/output examples in the prompt to guide the model toward a desired format or behavior. (Modules 2, 6)
Negative Samples (Prompting): Including examples of what the model should not do alongside positive examples in the prompt. (Module 6)
Prompt Engineering: The practice of carefully crafting model inputs to guide outputs. Techniques include few-shot examples, chain-of-thought prompting, role assignment, and negative samples. (Module 6)
Role / Persona Assignment: Adding a role or persona to the system prompt to guide the model’s tone, style, and perspective. (Module 6)
Zero-shot: The model’s ability to perform a task with no examples provided in the prompt, relying entirely on knowledge from pretraining. (Module 1)

Reasoning Models

Reasoning / Thinking Models: Models fine-tuned to produce a “thinking” phase before their final answer, giving them a scratch space for exploration and self-correction. (Module 6)
Thinking Tokens: Tokens the model uses to reason before producing its visible answer. OpenAI’s o1/o3 use hidden thinking tokens; many open-weight models use visible <think>/</think> delimiters. (Module 6)

Evaluation

Dataset Contamination: When benchmark questions appear in a model’s training data, inflating benchmark scores through memorization rather than genuine capability. (Module 6)
Evals (Evaluations): Benchmarks and test suites used to measure model capabilities, track progress over time, and detect regressions. (Module 6)
GPQA (Graduate-Level Google-Proof Q&A): A PhD-level scientific reasoning benchmark in biology, physics, and chemistry. Designed to be unsolvable via web search alone. (Module 6)
HLE (Humanity’s Last Exam): A benchmark of 2,500 expert-level questions requiring multimodal, multi-step reasoning. Created by CAIS and Scale AI. (Module 6)
LLM as a Judge: Using a separate LLM to evaluate the outputs of another LLM for quality, safety, or accuracy. (Module 8)
MMLU (Massive Multitask Language Understanding): A multi-domain multiple-choice benchmark covering STEM, humanities, and more. Published in 2021; now largely saturated by top models. (Module 6)
MMLU-Pro: A harder version of MMLU with 12,000 questions across 14 subjects. Released June 2024 to address model saturation of the original MMLU. (Module 6)
Red-Teaming: Adversarial testing of models to systematically find failure modes, biases, and safety vulnerabilities. (Modules 6, 8)
SOTA (State of the Art): The best-performing result on a given benchmark or task at a given point in time. (Module 1)
SWE-Bench: A benchmark testing AI ability to resolve real GitHub issues from open-source Python repositories. Created in 2023 by the Princeton NLP group. (Module 6)
W&B / Weights & Biases: An ML experiment tracking platform for monitoring training metrics (loss, accuracy, GPU utilization) across runs. (Module 7)

Multimodal and Image Generation Concepts

AnimateDiff: A video generation extension of the diffusion model framework. (Module 4)
ControlNet: An architecture (Stanford, February 2023) that adds spatial control to diffusion models by creating a trainable copy of the U-Net encoder to accept conditioning inputs (depth maps, pose skeletons, edge maps) while keeping original model weights frozen. (Module 4)
Depth Map: An image where pixel values represent distance from the viewer. Used as a spatial control input for image generation. (Module 4)
Denoising / Reverse Diffusion: The inference phase of a diffusion model: starting from pure random noise and iteratively removing noise, guided by a text prompt, to produce a coherent image. (Module 4)
Forward Diffusion Process: The training phase of a diffusion model: progressively adding random noise to real images. The model learns to predict what noise was added at each step. (Module 4)
Image-to-Image: A model capability that generates a modified image from an existing image and a text prompt, using partial denoising of the source image. (Module 4)
Inpainting: Filling in missing or masked regions of an image in a realistic way, steered by the surrounding context and a text prompt. (Module 4)
OpenPose: A human pose estimation model whose output skeleton can be used as a conditioning input for ControlNet. (Module 4)
Outpainting: Extending an image beyond its original borders by treating the new region as a masked area and applying inpainting. (Module 4)
Prompt Upsampling: An image generation feature that runs a short prompt through an LLM to make it more detailed and descriptive before passing it to the image model. (Module 4)
Safety Classifier / Safety Tolerance: Separate classifier models that run alongside generative image models to filter harmful prompts (input filtering) or flag unsafe generated images (output filtering). (Module 4)
Super Resolution: An image-to-image task of increasing the resolution and detail of an existing image. (Module 4)
Style Transfer: An image-to-image task of recreating an image in a different artistic style. (Module 4)
Text-to-Image: A model capability that generates an image from a natural language text prompt using a diffusion process. (Module 4)

Tools, Platforms, and Services

Gradio: A Python library for rapidly building web UIs for ML demos. Supports text, images, audio, and streaming. Acquired by Hugging Face in 2021. (Modules 2, 3)
Google Colab: A cloud-based Jupyter notebook environment with free GPU and TPU access. (Module 1)
Hugging Face: The central platform for sharing AI models, datasets, and demos. Often described as “GitHub for AI models.” (Modules 2–7)
Hugging Face Datasets: Hugging Face’s repository of public training datasets, stored in Parquet format. Used for uploading fine-tuning data. (Module 7)
Hugging Face Spaces: Free cloud hosting for ML demos, supporting Gradio, Streamlit, and Docker. (Module 3)
Hugging Face Transformers Library: An open-source Python library providing unified access to thousands of pre-trained transformer models across PyTorch, TensorFlow, and JAX. (Module 2)
HF Pipelines: A high-level abstraction in the Hugging Face Transformers library that simplifies model usage with a standardized API across task types. (Module 4)
JAX: Google’s numerical computing library, used as an alternative backend for Hugging Face Transformers. (Module 2)
OpenRouter: An inference provider offering a unified OpenAI-compatible API to hundreds of models from OpenAI, Anthropic, Google, Meta, and others. Pay-per-call pricing. (Module 2)
PyTorch: The dominant deep learning framework for research and production, used by most Hugging Face models. (Modules 2, 5)
Replicate: A model hosting platform focused on image and video models. Offers pay-per-call pricing and supports fine-tuning. (Module 4)
TensorFlow: Google’s deep learning framework; an alternative to PyTorch. (Modules 2, 5)
tiktoken: OpenAI’s tokenization library for estimating token counts. (Module 2)

Ethics, Safety, and Intellectual Property

Adversarial Optimization (Jailbreak): A jailbreaking technique that uses gradient-based optimization to find token sequences that reliably bypass a model’s safety guardrails. (Module 8)
Bias and Fairness: The reflection of societal biases present in training data into model outputs. A mathematical inevitability when training data reflects historical inequities. (Module 8)
C2PA (Coalition for Content Provenance and Authenticity): An organization developing open standards for cryptographically signing media at the point of creation to verify its authenticity and origin. (Module 8)
Capability Bounding: Limiting a model’s scope and capabilities via fine-tuning, alignment, or system prompting to prevent unintended behaviors. (Module 8)
Confidence Calibration: Training models to express appropriate uncertainty rather than stating everything with equal, unwarranted conviction. (Module 8)
Copyright / Fair Use: The ongoing legal question of whether training AI models on copyrighted works constitutes fair use or infringement. (Module 8)
DAN (Do Anything Now): An early ChatGPT jailbreak prompt that triggered an alter-ego mode, bypassing safety restrictions. (Module 8)
Deepfake: AI-generated synthetic media — text, image, video, or audio — used to impersonate a real person or deceive an audience. (Module 8)
EU AI Act: European regulation governing AI systems, including provisions for high-risk domain oversight, environmental documentation, and mandatory labeling of deepfakes. (Module 8)
Explainability / Black Box: The difficulty of understanding or auditing a neural network’s internal reasoning. Raises ethical concerns in high-stakes domains like healthcare, law, and defense. (Module 8)
Fiction Framing Attack: A jailbreaking technique that wraps a harmful request inside a fictional storytelling context to bypass safety guardrails. (Module 8)
Hallucination: When a model generates plausible-sounding but factually incorrect information. Not a traditional software bug but a consequence of stochastic next-token prediction. (Modules 1, 6, 7)
Input / Output Filtering: Safety classifier layers that analyze prompts before they reach the model and/or screen model responses before they are returned to the user. (Modules 4, 8)
Jailbreaking: Attempts to bypass a model’s safety guardrails and system prompt constraints to produce disallowed content. (Module 8)
Model Card: A documentation file (typically README.md) accompanying a Hugging Face model, describing its training process, benchmark results, intended uses, limitations, and risks. (Module 7)
Open Weights: A model distribution where the trained weight files are publicly downloadable, but the training data and full training code are not. Enables local deployment and fine-tuning. (Module 2)
Prompt Injection: An attack where malicious content in the model’s input attempts to override the system prompt or hijack the model’s instructions. (Module 8)
Vibe Hacking: Anthropic’s term for AI-assisted automation of large portions of a cybercrime campaign, lowering the barrier for sophisticated attacks. (Module 8)
Voice Cloning: An AI technique that replicates a person’s voice from a small audio sample, raising significant impersonation and consent concerns. (Module 8)

Data and Training Infrastructure

AdamW: An optimizer variant commonly used for fine-tuning LLMs. (Module 7)
Adapter: The small set of trained LoRA matrices (A and B) that encode a behavioral change. Can be kept as a separate file or merged into the base model weights. (Module 7)
Checkpoint: A saved snapshot of model weights at a point during training, allowing training to be resumed or a specific point in training to be evaluated. (Module 7)
JSONL (JSON Lines): A file format where each line is a valid JSON object. The standard format for fine-tuning datasets. (Module 6)
Model Weights: The numerical parameters of a trained model — the “knowledge” encoded during the training process. (Modules 2, 7)
Parameters: The individual numerical values in a model’s weight matrices. Model size is commonly expressed in billions of parameters (e.g., 7B, 70B). (Modules 1, 2, 5)
Parquet: A columnar data storage format used by Hugging Face Datasets. (Module 7)
Test Set: A held-out portion of data (~10–15%) used only after training is complete to provide an unbiased final performance measure. (Module 6)
Training Set: The largest portion of data (~70–80%) that the model directly learns from during fine-tuning. (Module 6)
Validation Set: A held-out portion of data (~10–15%) used during training to monitor generalization and detect overfitting. (Module 6)
WebText: The dataset of 8 million web pages (~40GB of text) used to train GPT-2. (Module 1)