Module 6: Increasing Model Accuracy (Part 1)

Recap

Understood the use cases, advantages/disadvantages for running models on local hardware - desktop, web, mobile
Understood hardware requirements and architectures for model inference - e.g., CUDA vs. ONNX vs. MLX vs. WebGPU
Explored how quantization works and understood techniques and formats for quantizing existing models
Used llama.cpp to quantize and run an SLM on local hardware/gaming PC
Integrated a quantized model within Unity/Unreal/WebAssembly

Lesson Objectives

Understand what leads to hallucinations in models, how models are evaluated, and an overview of techniques to increase accuracy
Explore prompt engineering and thinking models
Introduce and implement Text-to-SQL and RAG (Retrieval-Augmented Generation) to increase the accuracy of a limited SLM
Start exploring model fine-tuning
Generate synthetic data for fine-tuning a small language model

Hallucinations

Breaking News: Models hallucinate!
- Especially noticeable on older and smaller models
- Many hallucinations can be difficult to spot as they sound so plausible
- Plausible sounding requests (i.e., asking about something that doesn’t exist) often result in a hallucination

Hallucinations

MODEL = "meta-llama/llama-3.2-1b-instruct"

import openai

client = openai.OpenAI(
    base_url='https://openrouter.ai/api/v1',
    api_key=os.environ["OPENROUTER_API_KEY"],
)

Hallucinations: Citations

response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You help students generate citations for their coursework."},
        {"role": "user", "content": "Can you cite three academic papers on transfomer efficiency. I'm looking for the author names, journals, years, and titles."},
    ],
)
response.choices[0].message.content

Hallucinations: Citations

'Here are three academic papers on transformer efficiency:\n\n1. **"Attention is a Limited Resource: The Case for Unique Attention Mechanisms in Deep Neural Networks"**\n\nAuthor: Vaswani et al.\n\nJournal: Nature\n\nYear: 2017\n\nTitle: Attention is a Limited Resource: The Case for Unique Attention Mechanisms in Deep Neural Networks\n\nThis paper proposes a new approach to attention mechanisms in deep neural networks, which suggests that attention is not just a simple linear scaling of the input features, but rather a complex mechanism that involves multiple stages and attention weights. The authors demonstrate that this approach can improve the performance of deep neural networks on various tasks.\n\n2. **"Attention is All You Need"**\n\nAuthor: Vaswani et al.\n\nJournal: Nature\n\nYear: 2017\n\nTitle: Attention is All You Need\n\nThis paper presents a new approach to attention mechanisms in deep neural networks, which suggests that attention is not just a simple linear scaling of the input features, but rather a complex mechanism that involves multiple stages and attention weights. The authors demonstrate that this approach can improve the performance of deep neural networks on various tasks.\n\n3. **"Attention is a Key to Efficient Neural Network Training"**\n\nAuthor: He et al.\n\nJournal: Proceedings of the 32nd International Conference on Machine Learning (ICML)\n\nYear: 2015\n\nTitle: Attention is a Key to Efficient Neural Network Training\n\nThis paper presents a new approach to attention mechanisms in deep neural networks, which suggests that attention is a key factor in improving the efficiency of neural network training. The authors demonstrate that attention mechanisms can significantly improve the training speed and accuracy of deep neural networks on various tasks.\n\nNote: These papers are just a few examples of the many papers that have explored transformer efficiency.'

Hallucinations: Local Knowledge

response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You help provide information about software applications."},
        {"role": "user", "content": "Tell me more about FilterGenie, software for the Mac that allows you to apply dynamic filters to images."},
    ],
)
response.choices[0].message.content

Hallucinations: Local Knowledge

"FilterGenie is a powerful image editing software for Mac that allows you to apply dynamic filters to images. It's a popular tool among photographers, graphic designers, and anyone looking to enhance and transform their images with ease.\n\nHere are some key features and benefits of FilterGenie:\n\n**Key Features:**\n\n1. **Dynamic Filters**: FilterGenie offers a wide range of dynamic filters that can be applied to images, including:\n\t* Color correction and adjustment\n\t* Noise reduction and sharpening\n\t* Texture and pattern overlays\n\t* Effects like blur, vignette, and glow\n\t* And many more!\n2. **Adjustable Parameters**: Each filter has adjustable parameters, allowing you to fine-tune the effect to your liking.\n3. **Support for Multiple Image Files**: FilterGenie supports a wide range of image file formats, including JPEG, PNG, and TIFF.\n4. **Batch Processing**: You can apply filters to multiple images at once, making it a great tool for editing multiple images in a single session.\n5. **Export Options**: FilterGenie allows you to export your edited images in various formats, including JPEG, PNG, and PSD.\n\n**Benefits:**\n\n1. **Easy to Use**: FilterGenie's intuitive interface makes it easy to apply filters to images, even for those without extensive image editing experience.\n2. **Versatile**: The software offers a wide range of filters to suit various image editing needs, from basic adjustments to more advanced effects.\n3. **High-Quality Results**: FilterGenie's filters are designed to produce high-quality results, ensuring your images look professional and polished.\n4. **Cross-Platform Compatibility**: FilterGenie is available for Mac, Windows, and Linux, making it a great option for users with multiple operating systems.\n\n**System Requirements:**\n\nTo run FilterGenie, you'll need:\n\n* Mac with macOS 10.12 or later\n* 2 GB RAM or more\n* 2 GB free disk space\n* A compatible graphics card (Intel Core i5 or later recommended)\n\n**Pricing:**\n\nFilterGenie offers a free trial version, as well as a one-time purchase option for the full version. The full version costs around $49.99 USD.\n\nOverall, FilterGenie is a powerful and versatile image editing software that's perfect for anyone looking to enhance and transform their images with dynamic filters."

Hallucinations: Summarization

response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You help users summarize documents and published papers."},
        {"role": "user", "content": "Can you summarize the key findings of the 2019 WHO report on urban air quality in South Asia."},
    ],
)
response.choices[0].message.content

Hallucinations: Summarization

"I couldn't find a specific 2019 WHO report on urban air quality in South Asia. However, I can provide some general information on the topic and highlight some key findings from previous reports.\n\nThe World Health Organization (WHO) has published several reports on air quality in South Asia, including:\n\n1. **2018 WHO Air Quality Guidelines**: In 2018, WHO published guidelines for air quality in urban areas, which emphasized the importance of reducing air pollution from domestic and industrial sources. The guidelines recommended that cities implement measures to reduce particulate matter (PM2.5) and other pollutants.\n2. **2019 WHO Air Quality Report**: In 2019, WHO published a report on air quality in South Asia, which highlighted the region's vulnerability to air pollution. The report noted that South Asia is home to some of the world's most populous cities, including Delhi, Mumbai, and Dhaka, which are among the most polluted cities in the world.\n3. **2019 WHO Air Quality Report: South Asia**: This report focused on the health impacts of air pollution in South Asia, including the estimated number of premature deaths and hospitalizations due to air pollution. The report also highlighted the need for cities to implement effective air quality management strategies to reduce pollution.\n\nSome key findings from these reports include:\n\n* **Air pollution is a major public health concern in South Asia**: The region is home to some of the world's most polluted cities, with PM2.5 levels exceeding the WHO's recommended limits in many areas.\n* **Air pollution affects vulnerable populations**: The report highlighted the disproportionate impact of air pollution on vulnerable populations, including children, the elderly, and people with pre-existing respiratory conditions.\n* **Air pollution is linked to increased mortality**: The report estimated that air pollution is responsible for an estimated 1.8 million premature deaths per year in South Asia, primarily due to respiratory diseases such as bronchitis and lung cancer.\n* **Air pollution affects economic productivity**: The report noted that air pollution can have significant economic impacts, including reduced economic productivity and increased healthcare costs.\n\nSome specific data points from the 2019 WHO Air Quality Report include:\n\n* Delhi, India: PM2.5 levels exceeded the WHO's recommended limit in 2019, with an average annual concentration of 44.4 μg/m3.\n* Mumbai, India: PM2.5 levels exceeded the WHO's recommended limit in 2019, with an average annual concentration of 35.4 μg/m3.\n* Dhaka, Bangladesh: PM2.5 levels exceeded the WHO's recommended limit in 2019, with an average annual concentration of 23.4 μg/m3.\n\nOverall, the 2019 WHO Air Quality Report highlights the urgent need for cities in South Asia to take action to reduce air pollution and protect public health."

Why Do Models Hallucinate?

A language model is not a database
- It’s a stochastic prediction machine
- Models often don’t know how to say “I don’t know”
- Instead, they are designed to come up with the most plausible continuation, not to retrieve verified facts
- (Hallucination isn’t a bug in the traditional sense)

Why Do Models Hallucinate?

Models are trained on a large corpus of Internet data
- Data on the Internet is often uneven
- Full of gaps
- Often contradictory

Why Do Models Hallucinate?

Model size vs. Training set size
- Many of the latest models hold billions of parameters
- But they are trained on trillions of tokens
- Thus, they can’t memorize everything and instead recognize patterns to recall information

Model Accuracy

Fortunately, we can improve the accuracy of models using several techniques:
- Prompt Engineering
- Reasoning/Thinking Models
- Context Injection
- Fine-tuning

Prompt Engineering

Best practices (introduced in Module 2)
- Be specific: “You are a Python programming tutor who explains concepts using simple analogies and provides code examples.”
- Define output: “List no more than 3 suggestions. Always show your work step by step.”
- Set boundaries: “If you are asked questions outside coding, politely redirect the student back to the task.”

Prompt Engineering

Output-formatting examples:
- Providing 2-5 examples of desired output can dramatically improve performance
- “Please provide your output as…”
- “Format your answer in the following way…”
- (Of course, structured outputs, if it’s available, can be more reliable)

Prompt Engineering

Few-shot examples:
- Providing 2-5 examples of input/output pairs dramatically improves performance
- Example: “Q: What’s 15% of 80? A: Let me calculate: 80 × 0.15 = 12”
- Especially effective for formatting, tone, or structure-heavy tasks

Prompt Engineering

Role/persona assignment:
- It is common to add a role/persona to help guide the model
- “You are a senior engineer reviewing code for a junior developer”
- “You are a geography teacher working with high school students”

Prompt Engineering

Negative samples:
- Showing what not to do alongside positive examples
- “Do not add ```python markdown code fences before generating code”
Chain-of-Thought:
- Explicitly asking the model to “think step-by-step”
- Which introduces our next technique: reasoning/thinking models

Reasoning/Thinking Models

In 2022, Google researchers showed that prompting models to think “step by step” dramatically improved performance on math and reasoning tasks.
- Paper: “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models” (Wei et al. 2022)
- This prompt engineering technique is known as “Chain-of-Thought”
- i.e., showing your work helped arrive at better answers

Reasoning/Thinking Models

2023: Models began to be trained with reinforcement learning on reasoning processes, not just final answers
2024-25: OpenAI released o1 and o3 models. “Thinking tokens” are a separate hidden phase before generating visible output.
Today: Thousands of models on Hugging Face, often suffixed with -Thinking

Reasoning/Thinking Models

How do reasoning/thinking models work?
- Models are fine-tuned to produce “thinking tokens” before their final answer
- The model learns to use the thinking phase as a “scratch space” where it can explore, self-correct, and reason
- Thinking tokens in open-weight models are often in-between <think> and </think> tokens (or just a single </think> token to divide the two)
- Thinking tokens for closed models (e.g., o1/o3) are typically hidden

Reasoning/Thinking Models

from transformers import pipeline

pipe = pipeline("text-generation", model="Qwen/Qwen3-4B-Thinking-2507")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

Reasoning/Thinking Models

[{'generated_text': [{'role': 'user', 'content': 'Who are you?'},
   {'role': 'assistant',
    'content': "Okay, the user asked, Who are you? I need to provide a clear and concise response. First, I should state my identity as Qwen, the large language model developed by Tongyi Lab. Next, I should mention my capabilities, such as answering questions, generating text, coding, and logical reasoning. It's important to highlight that I can handle multiple languages and is designed to be helpful and friendly. I should avoid technical jargon to keep it accessible. Also, I need to check if there's any specific context the user might have in mind, but since the question is straightforward, a general response should suffice. Let me make sure the response is structured well: start with who I am, then key features, and a friendly closing. Let me see... the user might be new to me, so keeping it simple and welcoming is key. Alright, time to put it all together in a natural, conversational tone.\n</think>\n\nI'm **Qwen**, a large language model developed by Tongyi Lab. I'm here to help with a wide range of tasks, including answering questions, writing stories, emails, scripts, logical reasoning, coding, and more. I can also express opinions and play games! Whether you need assistance with a problem or just want"}]}]

Hands-on

Test out different prompts in the hallucinations notebook (hallucinations.ipynb)

Try different prompts with Qwen 4B Thinking (thinking.ipynb)

(You may have to increase max_new_tokens in the pipeline constructor)

Reasoning/Thinking Models

“Tell me more about FilterGenie…”

Okay, the user is asking about FilterGenie, a macOS software for applying dynamic filters to images. Hmm, I need to recall what I know about this. Wait, I think FilterGenie might be a bit misleading here. Let me think carefully.\n\nFirst, I remember that there\'s a popular app called FilterGenie, but it\'s actually a Windows-only application. The user specifically mentioned it\'s for Mac. That\'s a red flag. I should double-check this to avoid giving wrong information. \n\nLet me mentally scan through common image editing apps for macOS. There\'s Lightroom, Affinity Photo, Photoshop, but those aren\'t called FilterGenie. Oh! I think there\'s a confusion with "FilterGenie" versus "FilterGenie" - wait, no. Actually, I recall that FilterGenie is a tool that\'s primarily for Windows. Maybe the user heard about it from a source that mixed up the platforms?\n\nI should consider if there\'s any Mac app with a similar name. Let me think... There\'s "Filter Magic" for Mac, but that\'s different. Or "Snapseed" by Google, but that\'s not it either. Hmm. \n\nWait a minute - I think the user might be mixing up FilterGen

Reasoning/Thinking Models

Prompt engineering and reasoning help the model increase its accuracy for the data that it’s been trained on
But it doesn’t account for data that it doesn’t know about
Enter context injection…

Context Injection

A model’s training data will only answer so much
- Training data has a cut-off: Your model won’t know anything after that
- Your model won’t know if anything has changed since being trained
- Function calling can help bridge this gap, but context injection is often simpler and more flexible

Context Injection

How it works
- Take the user’s prompt (e.g., “Who teaches CS-394?”)
- Before sending the prompt to the model, do a database look-up to find relevant information (context)
- Inject that context into the system prompt
- Call the model with the modified system prompt

Context Injection

The injected information is called “context”, and adding it is called “augmenting” the prompt
- Hence the popular term, augmented generation
- Two examples:
  - Text-to-SQL
  - Retrieval-Augmented Generation

Text-to-SQL

response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You help students lookup course information."},
        {"role": "user", "content": USER_PROMPT},
    ],
)
response.choices[0].message.content

Text-to-SQL

"I don't have access to specific course details like instructors for CS-394, as that information is typically managed by educational institutions. To find out who teaches CS-394, you could:\n\n1. Check your university/department's official course schedule or syllabus.  \n2. Contact the course instructor or department directly via email or phone.  \n3. Use your school's learning management system (e.g., Moodle, Canvas) if the course is listed there.  \n\nLet me know if you'd help drafting a message to inquire!\n"

Text-to-SQL

import sqlite3
import os

def create_and_populate_db(db_path=DB_PATH):
    db_exists = os.path.exists(db_path)

    conn = sqlite3.connect(db_path)
    conn.row_factory = sqlite3.Row
    cursor = conn.cursor()

    if db_exists:
        print(f"Found existing database at '{db_path}'. Skipping creation.")
        return conn

    print(f"Creating new database at '{db_path}'...")

    # --- Schema ---
    cursor.executescript("""
        CREATE TABLE instructors (
            id          INTEGER PRIMARY KEY,
            name        TEXT NOT NULL,
            email       TEXT,
            department  TEXT,
            bio         TEXT
        );

        CREATE TABLE courses (
            id              INTEGER PRIMARY KEY,
            code            TEXT NOT NULL UNIQUE,
            title           TEXT NOT NULL,
            description     TEXT,
            credits         INTEGER,
            level           TEXT CHECK(level IN ('intro', 'intermediate', 'advanced')),
            instructor_id   INTEGER REFERENCES instructors(id),
            max_enrollment  INTEGER
        );

        CREATE TABLE schedules (
            id          INTEGER PRIMARY KEY,
            course_id   INTEGER REFERENCES courses(id),
            days        TEXT,
            time_start  TEXT,
            time_end    TEXT,
            room        TEXT,
            semester    TEXT
        );
    """)

    # --- Instructors ---
    cursor.executemany(
        "INSERT INTO instructors (name, email, department, bio) VALUES (?, ?, ?, ?)",
        [
            ("Dr. Sarah Chen", "s.chen@university.edu", "Computer Science",
             "Specializes in machine learning and NLP. Author of 'Practical Deep Learning'."),
            ("Prof. Marcus Webb", "m.webb@university.edu", "Computer Science",
             "Focuses on systems programming and computer architecture."),
            ("Dr. Priya Nair", "p.nair@university.edu", "Data Science",
             "Expert in statistical learning, data visualization, and reproducible research."),
            ("Prof. James Okafor", "j.okafor@university.edu", "Computer Science",
             "Teaches software engineering and has 10 years of industry experience at major tech firms."),
            ("Dr. Elena Russo", "e.russo@university.edu", "Mathematics",
             "Research interests include linear algebra, optimization, and mathematical foundations of AI."),
        ]
    )

    # --- Courses ---
    cursor.executemany(
        """INSERT INTO courses (code, title, description, credits, level, instructor_id, max_enrollment)
           VALUES (?, ?, ?, ?, ?, ?, ?)""",
        [
            ("CS-101", "Introduction to Programming",
             "Fundamentals of programming using Python. Covers variables, control flow, functions, and basic data structures.",
             3, "intro", 2, 40),
            ("CS-201", "Data Structures and Algorithms",
             "Core data structures including linked lists, trees, graphs, and hash tables. Algorithm design and complexity analysis.",
             3, "intermediate", 2, 35),
            ("CS-301", "Machine Learning Fundamentals",
             "Supervised and unsupervised learning, model evaluation, feature engineering, and scikit-learn. Final project required.",
             4, "intermediate", 1, 30),
            ("CS-394", "How Generative AI Works",
             "Transformer architectures, large language models, prompt engineering, fine-tuning, and deployment. Hands-on with open-source models.",
             3, "advanced", 1, 25),
            ("CS-310", "Database Systems",
             "Relational database design, SQL, transactions, indexing, and an introduction to NoSQL systems.",
             3, "intermediate", 2, 20)
        ]
    )

    # --- Schedules ---
    cursor.executemany(
        """INSERT INTO schedules (course_id, days, time_start, time_end, room, semester)
           VALUES (?, ?, ?, ?, ?, ?)""",
        [
            (1,  "Mon/Wed/Fri", "09:00", "09:50", "Room 101", "Spring 2026"),
            (2,  "Tue/Thu",     "10:00", "11:20", "Room 204", "Spring 2026"),
            (3,  "Mon/Wed",     "13:00", "14:20", "Lab 12",   "Spring 2026"),
            (4,  "Tue/Thu",     "14:00", "15:20", "Lab 12",   "Spring 2026"),
            (5,  "Mon/Wed/Fri", "11:00", "11:50", "Room 305", "Spring 2026"),
        ]
    )

    conn.commit()
    print("Database created and populated successfully.")
    print(f"  - {cursor.execute('SELECT COUNT(*) FROM instructors').fetchone()[0]} instructors")
    print(f"  - {cursor.execute('SELECT COUNT(*) FROM courses').fetchone()[0]} courses")
    print(f"  - {cursor.execute('SELECT COUNT(*) FROM schedules').fetchone()[0]} schedules")

    return conn

create_and_populate_db()

Text-to-SQL

import sqlite3
import re

def get_course_by_code(conn, code):
    """Return full details for a single course by its code (e.g. 'CS-394')."""
    return conn.execute("""
        SELECT
            c.code,
            c.title,
            c.credits,
            c.level,
            c.description,
            c.max_enrollment,
            i.name AS instructor,
            i.email AS instructor_email,
            i.bio AS instructor_bio,
            s.days,
            s.time_start,
            s.time_end,
            s.room,
            s.semester
        FROM courses c
        JOIN instructors i ON c.instructor_id = i.id
        JOIN schedules s   ON s.course_id = c.id
        WHERE UPPER(c.code) = UPPER(?)
    """, (code,)).fetchone()



def rows_to_text(rows):
    """Convert sqlite3.Row results into a readable string block."""
    if not rows:
        return "No results found."
    if isinstance(rows, sqlite3.Row):
        rows = [rows]
    return "\n".join(
        "  " + ", ".join(f"{k}: {row[k]}" for k in row.keys())
        for row in rows
    )


def build_context(conn, user_query):
    """
    Extract a course code from the user query and retrieve its details.
    Returns a formatted string ready to inject into the system prompt.
    """
    code_match = re.search(r'\b([A-Z]{2,4}-\d{3})\b', user_query.upper())

    if not code_match:
        return "No course code found in query. Please include a course code (e.g. CS-394)."

    code = code_match.group(1)
    course = get_course_by_code(conn, code)

    if not course:
        return f"No course found with code {code}."

    return (
        f"Course details for {code}:\n{rows_to_text(course)}\n\n"
    )

Text-to-SQL

# Connect to the sqlite database
conn = create_and_populate_db()

# Query the database based on the user prompt
context = build_context(conn, USER_PROMPT)

# Create system prompt
SYSTEM_PROMPT = f"""
You help students lookup course information. Here are the course details:

---COURSE INFORMATION---
{context}
---END COURSE INFORMATION---
"""

# Query the model
response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": USER_PROMPT},
    ],
)
response.choices[0].message.content

Text-to-SQL

Found existing database at '.data/course_catalog.db'. Skipping creation.

'The instructor for CS-394 "How Generative AI Works" is **Dr. Sarah Chen**.\n'

Text-to-SQL

Improved results, but…
- Requires the course code to be in the user prompt, matched to a RegEx pattern
- Will answer “Who teaches CS-394?” and “When is CS-394 held?”
- Won’t answer broader queries:
  - “What courses teach generative AI?”
  - “Which courses are on Friday afternoons?”
- Could make a free-text search through the database…
- But there’s a better way…

RAG: Retrieval-Augmented Generation

Roots of RAG trace back to 1950s and 60s, when researchers were working with vector space models
Term was coined in 2020 in “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” (Lewis et al. 2020)
“We definitely would have put more thought into the name had we known our work would become so widespread”

How RAG Works

Create “documents”
- Strings of text - can be from existing db queries or scraped from PDFs
- Generate embeddings (using a Sentence Transformer)
- Store embeddings in a database
User prompt is converted into an embedding
Find the closest set of embeddings that match and inject into the system prompt

How RAG Works

import os
import numpy as np

def create_connection(db_path=DB_PATH):
    conn = sqlite3.connect(db_path)
    conn.row_factory = sqlite3.Row
    conn.enable_load_extension(True)
    sqlite_vec.load(conn)
    conn.enable_load_extension(False)
    return conn

def create_and_populate_db(db_path=DB_PATH):
    db_exists = os.path.exists(db_path)

    conn = create_connection(db_path)
    cursor = conn.cursor()

    if db_exists:
        print(f"Found existing database at '{db_path}'. Skipping creation.")
        return conn

    print(f"Creating new database at '{db_path}'...")

    # --- Schema ---
    cursor.executescript("""
        CREATE TABLE instructors (
            id          INTEGER PRIMARY KEY,
            name        TEXT NOT NULL,
            email       TEXT,
            department  TEXT,
            bio         TEXT
        );

        CREATE TABLE courses (
            id              INTEGER PRIMARY KEY,
            code            TEXT NOT NULL UNIQUE,
            title           TEXT NOT NULL,
            description     TEXT,
            credits         INTEGER,
            level           TEXT CHECK(level IN ('intro', 'intermediate', 'advanced')),
            instructor_id   INTEGER REFERENCES instructors(id),
            max_enrollment  INTEGER
        );

        CREATE TABLE schedules (
            id          INTEGER PRIMARY KEY,
            course_id   INTEGER REFERENCES courses(id),
            days        TEXT,
            time_start  TEXT,
            time_end    TEXT,
            room        TEXT,
            semester    TEXT
        );

        CREATE VIRTUAL TABLE course_embeddings USING vec0(
            course_id INTEGER PRIMARY KEY,
            embedding FLOAT[384]
        );
    """)

    # --- Instructors ---
    cursor.executemany(
        "INSERT INTO instructors (name, email, department, bio) VALUES (?, ?, ?, ?)",
        [
            ("Dr. Sarah Chen", "s.chen@university.edu", "Computer Science",
             "Specializes in machine learning and NLP. Author of 'Practical Deep Learning'."),
            ("Prof. Marcus Webb", "m.webb@university.edu", "Computer Science",
             "Focuses on systems programming and computer architecture."),
            ("Dr. Priya Nair", "p.nair@university.edu", "Data Science",
             "Expert in statistical learning, data visualization, and reproducible research."),
            ("Prof. James Okafor", "j.okafor@university.edu", "Computer Science",
             "Teaches software engineering and has 10 years of industry experience at major tech firms."),
            ("Dr. Elena Russo", "e.russo@university.edu", "Mathematics",
             "Research interests include linear algebra, optimization, and mathematical foundations of AI."),
        ]
    )

    # --- Courses ---
    cursor.executemany(
        """INSERT INTO courses (code, title, description, credits, level, instructor_id, max_enrollment)
           VALUES (?, ?, ?, ?, ?, ?, ?)""",
        [
            ("CS-101", "Introduction to Programming",
             "Fundamentals of programming using Python. Covers variables, control flow, functions, and basic data structures.",
             3, "intro", 2, 40),
            ("CS-201", "Data Structures and Algorithms",
             "Core data structures including linked lists, trees, graphs, and hash tables. Algorithm design and complexity analysis.",
             3, "intermediate", 2, 35),
            ("CS-301", "Machine Learning Fundamentals",
             "Supervised and unsupervised learning, model evaluation, feature engineering, and scikit-learn. Final project required.",
             4, "intermediate", 1, 30),
            ("CS-394", "How Generative AI Works",
             "Transformer architectures, large language models, prompt engineering, fine-tuning, and deployment. Hands-on with open-source models.",
             3, "advanced", 1, 25),
            ("CS-310", "Database Systems",
             "Relational database design, SQL, transactions, indexing, and an introduction to NoSQL systems.",
             3, "intermediate", 4, 20),
        ]
    )

    # --- Schedules ---
    cursor.executemany(
        """INSERT INTO schedules (course_id, days, time_start, time_end, room, semester)
           VALUES (?, ?, ?, ?, ?, ?)""",
        [
            (1, "Mon/Wed/Fri", "09:00", "09:50", "Room 101", "Spring 2026"),
            (2, "Tue/Thu",     "10:00", "11:20", "Room 204", "Spring 2026"),
            (3, "Mon/Wed",     "13:00", "14:20", "Lab 12",   "Spring 2026"),
            (4, "Tue/Thu",     "14:00", "15:20", "Lab 12",   "Spring 2026"),
            (5, "Mon/Wed/Fri", "11:00", "11:50", "Room 305", "Spring 2026"),
        ]
    )

    conn.commit()

    # --- Embeddings ---
    # Embed each course as "title: description" so the vector captures both
    print("Generating embeddings...")
    courses = conn.execute("SELECT id, title, description FROM courses").fetchall()
    for course in courses:
        text = f"{course['title']}: {course['description']}"
        embedding = embedder.encode(text).astype(np.float32)
        cursor.execute(
            "INSERT INTO course_embeddings (course_id, embedding) VALUES (?, ?)",
            (course['id'], embedding.tobytes())
        )

    conn.commit()
    print("Database created and populated successfully.")
    print(f"  - {cursor.execute('SELECT COUNT(*) FROM instructors').fetchone()[0]} instructors")
    print(f"  - {cursor.execute('SELECT COUNT(*) FROM courses').fetchone()[0]} courses")
    print(f"  - {cursor.execute('SELECT COUNT(*) FROM schedules').fetchone()[0]} schedules")
    print(f"  - {cursor.execute('SELECT COUNT(*) FROM course_embeddings').fetchone()[0]} embeddings")

    return conn

conn = create_and_populate_db()

How RAG Works

def search_courses(conn, user_query, top_k=3):
    """
    Embed the user query and find the most semantically similar courses.
    Returns the top_k closest courses with full details.
    """
    query_embedding = embedder.encode(user_query).astype(np.float32)

    return conn.execute("""
        SELECT
            c.code,
            c.title,
            c.description,
            c.credits,
            c.level,
            c.max_enrollment,
            i.name AS instructor,
            i.email AS instructor_email,
            s.days,
            s.time_start,
            s.time_end,
            s.room,
            s.semester,
            e.distance
        FROM course_embeddings e
        JOIN courses c    ON e.course_id = c.id
        JOIN instructors i ON c.instructor_id = i.id
        JOIN schedules s   ON s.course_id = c.id
        WHERE e.embedding MATCH ?
          AND k = ?
        ORDER BY e.distance
    """, (query_embedding.tobytes(), top_k)).fetchall()


def rows_to_text(rows):
    if not rows:
        return "No results found."
    if isinstance(rows, sqlite3.Row):
        rows = [rows]
    return "\n".join(
        "  " + ", ".join(f"{k}: {row[k]}" for k in row.keys())
        for row in rows
    )


def build_context(conn, user_query):
    rows = search_courses(conn, user_query)
    return f"Most relevant courses for the query:\n{rows_to_text(rows)}"

How RAG Works

# Connect to the sqlite database
conn = create_and_populate_db()

# Query the database based on the user prompt
context = build_context(conn, USER_PROMPT)

# Create system prompt
SYSTEM_PROMPT = f"""
You help students lookup course information. Here are the course details:

---COURSE INFORMATION---
{context}
---END COURSE INFORMATION---
"""

# Query the model
response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": USER_PROMPT},
    ],
)
response.choices[0].message.content

How RAG Works

“Which courses teach about generative AI?”

How RAG Works

Found existing database at '.data/course_catalog_rag.db'. Skipping creation.

'The course that specifically teaches about generative AI is:\n\n**CS-394: How Generative AI Works**  \n- **Description**: Covers transformer architectures, large language models, prompt engineering, fine-tuning, and deployment. Includes hands-on work with open-source models.  \n- **Details**:  \n  - Credits: 3 | Level: Advanced  \n  - Instructor: Dr. Sarah Chen | [Contact](mailto:s.chen@university.edu)  \n  - Time: Tue/Thu 14:00–15:20 | Room: Lab 12  \n  - Semester: Spring 2026  \n\nThis course directly addresses generative AI concepts and techniques.\n'

Hands-on

Investigate text-to-sql.ipynb and rag.ipynb

Try different queries - e.g., ask for courses on particular days on in particular locations

RAG Chunking

Large documents will require splitting (a.k.a. chunking)
How documents are split before embedding has a major impact on retrieval quality
- Fixed-size: Split every N tokens, with optional overlap. Simple and predictable, but can cut mid-sentence.
- Sentence/paragraph-aware: Split on natural boundaries (sentences, paragraphs). Better coherence, though chunk sizes will vary.
- Semantic chunking: Group sentences by embedding similarity, splitting where meaning shifts. Highest quality, but more complex.

RAG for Photo Indexing

RAG can also be used for photo/image indexing
Documents can be created using VLMs (Vision Language Models)
- Descriptions from VLMs (together with photo metadata) can then be used for lookup
- “Find me photos with two dogs next to the beach”
- “Show me all my photos of sunsets in Hawaii”

RAG for Photo Indexing

RAG Databases

Extensions
- sqlite-vec: SQLite Extension (we’ve been using that!)
- pgvector: PostgreSQL extension
Libraries
- FAISS: (From Meta) In-memory vector index vs. database. Very fast.

RAG Databases

OpenAI: Vector Stores (we used this in Module 3 for FileSearch)
Pinecone: Popular commercial managed/cloud option
Qdrant: Open-source dedicated db, written in Rust
Milvus: Open-source. Heavier to operate, but can exceed a billion embeddings

Side Topic: Evaluating Models

Evaluating Models

How do we know whether we are improving a model’s accuracy?
- Run evaluations (commonly known as “evals”)
- Allows us to benchmark progress, especially for fine-tuning
- We also want to know if performance has regressed (degraded) on certain tasks
- Evals also help grade models within certain domains (e.g., what is the SOTA model for math, coding, medical, law, etc.)

Evaluating Models

How evals are created:
- Curated question sets: Experts write or gather questions based on ground truth answers
- Human-validated responses: For open-ended tasks (e.g., coding) human evaluators score model responses - or models are tested against validation logic (e.g., unit tests)
- Adversarial design: The best evals try to anticipate how models might “cheat” - e.g., avoiding questions in training data, including distractors and edge cases

Popular Evals: MMLU Pro

https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro

What it tests: 12K questions across 14 subject areas that cover STEM, humanities, social sciences, and professional domains.
History: Original MMLU published in 2021 by Dan Hendrycks et al.
- MMLU-Pro created because models were plateauing on MMLU (hitting 85-90%+)
- Pro released in June 2024 by the TIGER-AI-Lab team and accepted at NeurIPS 2024

Popular Evals: GPQA

https://huggingface.co/datasets/Idavidrein/gpqa

What it tests: Graduate/PhD-level scientific reasoning in biology, physics, and chemistry through multiple-choice questions
- Require deep domain understanding, not just fact recall
- Cannot be solved by web search (the “Google-proof” property)
History:
- Created Nov 2023 by David Rein et al. (includes researchers from NYU, Anthropic, and other institutions)

Popular Evals: SWE-Bench

https://www.swebench.com/

What it tests: AI systems’ ability to solve real-world software engineering tasks by resolving actual GitHub issues from popular open-source Python repositories
History: Created: 2023 by Princeton NLP group (Carlos E. Jimenez et al.)
- Original dataset was 2,294 GitHub issues from 12 popular Python repositories
- Three major variants: Lite, Verified, and Pro
- Microsoft also publishes a “Live” version with monthly curated updates

Popular Evals: HLE

https://huggingface.co/datasets/cais/hle

What it tests: 2,500 expert-level questions across dozens of subjects, including mathematics, humanities, and the natural sciences.
- Require multimodal, multi-step reasoning rather than pattern matching or recall
History: Created in late 2024 by the Center for AI Safety (CAIS) and Scale AI
- Led by: Dan Hendrycks (who also created MMLU and MATH benchmarks)

Challenges With Evals

Dataset contamination
- Many popular benchmarks (MMLU, HumanEval) have been around for years and “leaked” into training data
- Models can achieve a high score by memorizing the questions/answers vs. true reasoning
Gaming the benchmark
- Researchers may optimize models specifically for benchmark performance without improving general capability - also known as “teaching to the test”

Challenges With Evals

Synthetic vs. human generated problems
- Many evals use multiple-choice, clean, well formatted problems.
- Real users index towards more open-ended questions, which may have typos, missing context, and other edge cases.
Bias and fairness
- Standard evals rarely measure performance across demographics - a model might excel on average, but fail for under-represented groups or languages.

What’s Next For Evals?

Agentic and multi-step workflows: Moving beyond single-question tests to longer tasks like research, multi-file codebases, and customer support conversations
Overcoming data contamination: Hugging Face’s Benchmarks and dynamic benchmarks that refresh regularly (like SWE-Bench Live) or generate novel questions programmatically to avoid training data leakage
Process evaluation over outcomes: Testing how models reason, not just final answers — can the model self-correct, identify its mistakes, and explain its reasoning?

Fine-tuning

Prompt engineering, using thinking models, and context injection (RAG) can help improve the output and accuracy of models
However, sometimes you need the model to have more consistency in its style, structure, or behavior
Enter fine-tuning…

Fine-tuning

Fine-tuning is the process of taking a pre-trained model and continuing to train it on a smaller, task-specific dataset using curated examples
This allows it to adapt its general capabilities toward a particular style, structure or behavior without learning from scratch

Fine-tuning

Examples
- Style: Adopting a different voice at scale: Speak to students using “Hinglish” at a 5th grade level
- Structure: Generating outputs in a precise format: Player movement instructions in a game, in JSON format
- Behavior: Always refuse off-topic questions and redirect to the subject matter (guardrail)

History of Fine-tuning

2010s: Fine-tuning became prevalent in computer vision (e.g., ImageNet-trained CNNs for medical)
2018: BERT demonstrated that a language model could be fine-tuned for sentiment, Q&A, and classification
2022: InstructGPT fine-tuning on human-curated instruction-response pairs
Today: Very popular technique for adapting trained models to different use cases

Generating Training Data

Before we can fine-tune a model, we need to generate training data
- Data are conversations or input/output pairs that demonstrate style, structure, or behavior you want the model to adopt

Generating Training Data

{"messages": [{"role": "user", "content": "s = \"Hello\\nWorld\"\nprint(s)\n"}, {"role": "assistant", "content": "The code defines a string that contains the escape sequence \\n, which tells Python to start a new line when the string is displayed. When you run the print statement, \"Hello\" appears on one line and \"World\" appears on the next line, just like pressing Enter moves the cursor down in a text editor.\n\nImagine you are writing a story on a typewriter. When you finish a line, you press the \u201cEnter\u201d key to move to the next line and continue typing. The \\n character works the same way for strings in Python, signaling that the following text should start on a new line."}]}

Generating Training Data

Three datasets
- Training set: What the model will learn from. Largest portion, about 70-80% of total data
- Validation set: Separate dataset, used to monitor how well the model is generalizing. About 10-15%.
- Test set: Separate dataset, provides an unbiased final measure of performance after training. About 10-15%

Generating Training Data

Diversity of training data
- Needs to cover the range of inputs that the model will encounter in production.
- A model trained only on short questions will struggle when users ask anything more complex
- Datasets should have different dimensions

Generating Training Data

Diversity dimensions
- Topics (areas, sub-domains)
- Audience (e.g., school grade)
- Length (e.g., short, med, long)
- Formats (e.g., script vs. romanized)
- Conversation turns (single / multi)
- Negative answers (e.g., if implementing safety)
Each of these can be weighted (e.g., 60% short, 20% med, 20% long)

Generating Training Data

How much data do you need? It depends :)
Behaviour and style shifts: 50-500 high quality examples
Multi-turn dialogue: 1K-10K examples
Smaller models often need more examples
Optimal dataset size is often discovered during training

Generating Training Data

Dataset Format
- Typically JSON Lines (jsonl) - JSON objects separated with newline
Converted to Hugging Face Datasets (and uploaded)
During training, these get converted to the chat template/format used by the model

Generating Training Data

How do we generate this data?
- Create the file by hand!
- Generate synthetic data from a more capable model (and review for quality)
- Known as “data distillation”
Synthetic Data
- “Generate n examples of students asking their teacher a geography question”
- Often use structured outputs to map to jsonl conversational format

Demo

Synthetic data generation using the generate-synthetic.ipynb notebook

Looking Ahead

This week’s assignment!
Use your generated synthetic data to fine-tune a small model
Use W&B (Weights and Biases) to observe parameters during the training run
Understand and create a model card, upload your model to Hugging Face and share

References

Lewis, Patrick, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, et al. 2020. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” Advances in Neural Information Processing Systems 33: 9459–74. https://arxiv.org/abs/2005.11401.

Wei, Jason, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” Advances in Neural Information Processing Systems 35: 24824–37. https://arxiv.org/abs/2201.11903.

Module 6: Increasing Model Accuracy (Part 1)

Recap

Lesson Objectives

Hallucinations

Hallucinations

Hallucinations

Hallucinations: Citations

Hallucinations: Citations

Hallucinations: Local Knowledge

Hallucinations: Local Knowledge

Hallucinations: Summarization

Hallucinations: Summarization

Why Do Models Hallucinate?

Why Do Models Hallucinate?

Why Do Models Hallucinate?

Model Accuracy

Model Accuracy

Prompt Engineering

Prompt Engineering

Prompt Engineering

Prompt Engineering

Prompt Engineering

Prompt Engineering

Reasoning/Thinking Models

Reasoning/Thinking Models

Reasoning/Thinking Models

Reasoning/Thinking Models

Reasoning/Thinking Models

Reasoning/Thinking Models

Hands-on

Reasoning/Thinking Models

Reasoning/Thinking Models

Context Injection

Context Injection

Context Injection

Context Injection

Text-to-SQL

Text-to-SQL

Text-to-SQL

Text-to-SQL

Text-to-SQL

Text-to-SQL

Text-to-SQL

Text-to-SQL

RAG: Retrieval-Augmented Generation

How RAG Works

Sidebar: Sentence Transformers

How RAG Works

How RAG Works

How RAG Works

How RAG Works

How RAG Works

How RAG Works

Hands-on

RAG Chunking

RAG for Photo Indexing

RAG for Photo Indexing

RAG Databases

RAG Databases

Side Topic: Evaluating Models

Evaluating Models

Evaluating Models

Popular Evals: MMLU Pro

Popular Evals: GPQA

Popular Evals: SWE-Bench

Popular Evals: HLE

Challenges With Evals

Challenges With Evals

What’s Next For Evals?

Fine-tuning

Fine-tuning

Fine-tuning

Fine-tuning

History of Fine-tuning

Sidebar: Can We Train Models From Scratch?

Generating Training Data

Generating Training Data

Generating Training Data

Generating Training Data

Generating Training Data