Module 7: Increasing Model Accuracy (Part 2)

Recap

Understood what leads to hallucinations in models, how models are evaluated, and an overview of techniques to increase accuracy
Explored prompt engineering and thinking models
Introduced and implemented Text-to-SQL and RAG (Retrieval-Augmented Generation) to increase the accuracy of a limited SLM
Started exploring model fine-tuning
Generated synthetic data for fine-tuning a small language model

Lesson Objectives

Use your generated synthetic data to fine-tune an SLM using QLoRA
Use W&B (Weights & Biases) to observe metrics during the training run
Post-training, test and evaluate the accuracy of your fine-tuned model
Merge, quantize, and upload your model to Hugging Face to share with others
Create a model card for your newly fine-tuned model

Model Training Stages

There are three main stages of training a model:
- Pretraining: The initial, large-scale training run that creates a base model with pretrained weights
- Supervised Fine-Tuning (SFT): Shapes the base model to follow instructions and respond in a desired style or format
- Alignment: Further refines the model toward preferred behaviors and values

Model Training Stages

Many models (e.g., Qwen3-1.7B) have already gone through these three stages
- We will be adding a fourth stage. An SFT of an existing “polished” model
Why?
- The model already knows how to follow instructions, answer questions, etc.
- They need far less training data to get good results
- Training tends to be faster and more stable since they are starting from a good baseline

Model Training Stages

Supervised Fine-Tuning (SFT)

How does SFT work?
- Build training and validation datasets (we did this already!)
- Feed the prompts into the model and compare the desired output tokens with the model’s predictions
- The difference between the two is known as the training loss

Supervised Fine-Tuning (SFT)

How does SFT work?
- Back propagate through the model, nudging the weights to guide toward the desired output
- This “nudging” is done using an optimizer (typically AdamW)
- Repeat for typically 1-3 epochs (epoch = complete pass through the dataset)
- Monitor validation loss and checkpoint the model at regular intervals

Supervised Fine-Tuning (SFT)

Training Loss tells us how close the model predicts vs. our desired outputs
Validation Loss tells us how well the model predicts on data that it hasn’t been trained on
- Prevents the model from memorizing the training data vs. generalizing
- “Memorizing training data” known as overfitting

Introducing LoRA

Adjusting all parameters in a model can be compute and memory intensive
Introducing LoRA: Low-Rank Adaptation of Large Language Models (Hu et al. 2021)

Introducing LoRA

How LoRA works:
- The model’s base weights (W) are frozen (never updated) during training
- Two small matrices (A and B) are introduced, whose product represents the desired change to W
- Only A and B are updated during training — a tiny fraction of the total parameters
- After training, these matrices are known as an “adapter”
- The adapter can be kept separate (swappable) or merged back into W to create a new, standalone model

Introducing LoRA

Introducing QLoRA

To further increase memory efficiency, we will use QLoRA (Quantized LoRA)
- Introducing QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al. 2023)
- The base model weights (W) are quantized to 4-bit (NF4 format) for storage
- During the forward pass, W is dequantized to bf16 on the fly for computation, then discarded
- Adapter matrices A and B are kept at full precision (bf16) for training accuracy

Introducing QLoRA

Our Fine-Tuning Plan

Our Fine-Tuning Plan (1/2)

Upload our .jsonl files to Hugging Face to create a dataset
In our training notebook…
- Download the dataset
- Format our training data to match the chat template
  - i.e., convert JSON to <|imstart|>user... etc.
- Start our training run!
Use W&B (Weights & Biases) to monitor how well our model is learning
- Training loss and validation loss

Our Fine-Tuning Plan (2/2)

After training has completed…
- Test (both manually and automatic)
Assuming things went well…
- Merge and upload our model to Hugging Face
- Quantize our model (create a GGUF) and upload to Hugging Face
- Download and test our model in LM Studio
Phew!

Upload Training Files

Upload our train.jsonl, validation.jsonl, and test.jsonl files to a Hugging Face Dataset
Why do this?
- Single repository
- Easy to hook into training libraries
- Shareable with others
- Better UI and query semantics vs. raw .jsonl files

Upload Training Files

from datasets import Dataset, DatasetDict
import json


def load_jsonl(file_path):
    data = []
    with open(file_path, "r", encoding="utf-8") as f:
        for line_num, line in enumerate(f, 1):
            line = line.strip()
            if not line:  # Skip empty lines
                continue
            try:
                # Try parsing the line
                data.append(json.loads(line))
            except json.JSONDecodeError as e:
                print(f"Warning: Error parsing line {line_num}: {e}")
                print(f"Problematic line: {line[:200]}...")
    return data


def create_hf_dataset(train_file, val_file, test_file):
    # Load the data
    train_data = load_jsonl(train_file)
    val_data = load_jsonl(val_file)
    test_data = load_jsonl(test_file)

    # Create datasets
    train_dataset = Dataset.from_list(train_data)
    val_dataset = Dataset.from_list(val_data)
    test_dataset = Dataset.from_list(test_data)

    # Combine into DatasetDict
    dataset_dict = DatasetDict(
        {
            "train": train_dataset,
            "validation": val_dataset,
            "test": test_dataset,
        }
    )

    return dataset_dict


# Create and validate the dataset is ready to upload
dataset = create_hf_dataset(TRAIN_FILE, VALIDATION_FILE, TEST_FILE)
print(f"Train samples: {len(dataset['train'])}")
print(f"Validation samples: {len(dataset['validation'])}")
print(f"Test samples: {len(dataset['test'])}")
print(f"\nSample entry: {dataset['train'][0]}")

Upload Training Files

def upload_dataset(dataset, repo_name, token):
    dataset.push_to_hub(
        repo_name,
        token=token,
        private=False
    )
    
    print(f"Dataset uploaded successfully to: https://huggingface.co/datasets/{repo_name}")

upload_dataset(dataset, DATASET_REPO, os.environ.get("HF_TOKEN"))

Demo

Uploading the training data files to Hugging Face Dataset

Upload Training Files

After you’ve uploaded
- It will take a few minutes before you can browse your data
- About 24-48 hours, you will likely receive an email about conversion to parquet format

Hands-on

Upload your training data files to Hugging Face Dataset

(CPU instance on Colab is fine for this)

The Training Notebook

train-cuda.ipynb contains the steps to fine-tune a small model
Main sections
- Set training parameters
- Download datasets from Hugging Face
- Create formatted dataset that uses the chat template
- Load base model and configure for QLoRA
- Create our trainer
- Train!

Demo

Walkthrough of the training notebook

Set parameters and start training!

Training Hyperparameters

BATCH_SIZE = 4

Number of training examples processed together in a single forward/backward pass
Larger batches = more stable gradient estimates, but consume more memory
2-8 recommended

Training Hyperparameters

GRADIENT_ACCUMULATION_STEPS = 4

Instead of updating weights after every batch, accumulate over this many batches before taking an optimizer step
Memory-efficient way to increase the batch size
2-16 recommended

Training Hyperparameters

LEARNING_RATE = 2e-4 (or 0.0002)

How large a step the optimizer takes when updating weights
Too high = training becomes unstable
Too low = training is too slow or gets stuck
1e-4 or 2e-4 are common starting points

Training Hyperparameters

NUM_EPOCHS = 3

Number of times the model sees the entire dataset
More epochs can improve training, but with the risk of overfitting
2-5 epochs for a few thousand examples recommended

Training Hyperparameters

MAX_SEQ_LENGTH = 512

The maximum number of tokens in a single training example
(Length of one of the lines in your jsonl file)
Should cover your longest examples without too much padding

Training Hyperparameters

LORA_R = 16

The rank of the adapter matrices - i.e., how many parameters are added during fine-tuning
Higher rank = more capacity, but also more memory and risk of overfitting
- e.g., 4096 x 4096 weight matrix = 16M parameters
- A = 4096x16; B=16x4096; Total = 131K parameters to be trained
4, 8, 16, 32 common values; 8-16 for small models

Training Hyperparameters

LORA_ALPHA = 32

A scaling factor for the LoRA updates
Usually set to 2x the rank (LORA_R)

Training Hyperparameters

LORA_DROPOUT = 0.05

Randomly zero out a small fraction of the adapter activations during training
Regularization technique to prevent overfitting
0.0 to 0.1 common range; use lower values for very small datasets

Training Hyperparameters

USE_4BIT = True

Enables QLoRA by loading the base model weights (W) in 4-bit precision
Optional for training 1B models in 8Gb VRAM
Definitely needed for 7B and 13B fine-tuning runs

Demo

Let’s check on our training!

Weights & Biases

What is Weights & Biases (W&B)?
- Founded in 2018 after founders noticed that researchers were keeping hyperparameters and training run details in spreadsheets / on paper, etc.
- Stream any metric (accuracy, loss, GPU utilization) to a single dashboard; compare training runs side-by-side to improve model accuracy
- Used either standalone via wandb.log() calls and/or integrated into many training libraries
- Generous free tier and education/researcher plans

Weights & Biases

System Metrics
- Metrics on the health of the underlying hardware used to train the model
- Provided automatically via the W&B library
- Useful for hardware capacity monitoring, especially if the training run crashes
What to look for:
- Stability - i.e., not running out of VRAM, disk space, etc.

Weights & Biases

Weights & Biases

Training Metrics
- Metrics on the performance of the training process (how accurate are the output tokens matching the training data)
- Number of tokens, token accuracy, training loss, grad_norm, etc.
What to look for:
- Token accuracy should increase quickly and plateau
- Training loss should decrease quickly and plateau
- Grad norm should not have drastic spikes (e.g., to 100)

Weights & Biases

Weights & Biases

Eval Metrics
- Metrics on the validation runs during training (predictions using the validation dataset)
- Number of tokens, token accuracy, eval loss (a.k.a. validation loss)
What to look for:
- Validation loss should decrease quickly and plateau
- Validation loss should be very close to training loss
- Validation loss should not start to increase (indicates overfitting)

Weights & Biases

Demo

W&B output during current training run

Let’s Do This!

Create your W&B account (https://wandb.ai), download API key, set in Colab (WANDB_API_KEY)

Open train-cuda.ipynb in Colab, configure the training parameters in the “Model parameters” cell

Change runtype to A100 and start your training run!

Note: Run one cell at a time, not all!

Post-Training

After training has completed:
- Test the model using a random input from our test dataset
- Create a model card, merge the models, and upload to Hugging Face
- Quantize the model and upload the GGUF to Hugging Face
- Download the GGUF in LM Studio!

Post-Training: Test

import random

idx = random.randint(0, len(formatted_dataset['test']) -1)
test_data = formatted_dataset['test'][idx]["text"]

output = pipe(test_data)
print(output[0]['generated_text'])

Post-Training: Create Model Card

What’s a model card?
- README.md for your model on Hugging Face
- Common sections
  - An overview of the model
  - How the model was trained
  - Benchmarks
  - How to use the model
  - Limitations, best practices, risks, safety, author details

Post-Training: Create Model Card

A default README.md will be generated by save_pretrained() method
You can overwrite this with your own
Or create one programmatically

Post-Training: Create Model Card

from huggingface_hub import ModelCard

card_content = f"""---
base_model: {BASE_MODEL}
tags:
- peft
- lora
- text-generation
---

# {MODEL_NAME}

## Model Description
Fine-tuned from `{BASE_MODEL}` using QLoRA (4-bit) with supervised fine-tuning.

## Training Details
- Dataset: `{DATASET_REPO}`
- LoRA rank: {LORA_R}, alpha: {LORA_ALPHA}
- Epochs: {NUM_EPOCHS}, Learning rate: {LEARNING_RATE}

## Intended Use

This model is a test model used for the CS-394/594 class at DigiPen.

The model is designed to provide a summary explanation of a snippet of Python code, to be used in an IDE. This model takes a snippet of code (passed as the user prompt) and returns a two paragraph explanation of what the code does, including an analogy that helps students better understand how the code functions.

## Limitations

This model is a single-turn model and has not been trained on support long, multi-turn conversations.
"""

card = ModelCard(card_content)
card.save(f"{MODEL_FOLDER}/lora/README.md")

Post-Training: Merge and Upload

Merging “fuses” the LoRA adapter and base model, creating a new model
Doing this will enable others to use your model
Upload your merged model (and optionally your adapter in a sub-folder) to Hugging Face

Post-Training: Merge and Upload

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

from peft import PeftModel

# Load the configuration and model
base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    device_map="auto",
    dtype=torch.bfloat16,
)
adapter_model = PeftModel.from_pretrained(base_model, f"{MODEL_FOLDER}/lora")

# Merge and save the model
merged_model = adapter_model.merge_and_unload() # type: ignore
merged_model.save_pretrained(f"{MODEL_FOLDER}/merged")
# Save the tokenizer
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
tokenizer.save_pretrained(f"{MODEL_FOLDER}/merged")

# Copy the model card to the merged folder
!cp {MODEL_FOLDER}/lora/README.md {MODEL_FOLDER}/merged/.

Post-Training: Merge and Upload

from huggingface_hub import HfApi, create_repo

# Set the MODEL_REPO
MODEL_REPO = f"{HF_USERNAME}/{MODEL_NAME}"

# Initialize the API
api = HfApi()

# Create the repository (if it doesn't exist)
try:
    create_repo(MODEL_REPO, repo_type="model", exist_ok=True, private=False)
    print(f"Repository {MODEL_REPO} created or already exists")
except Exception as e:
    print(f"Error creating repo: {e}")

# Upload merged model files to root (this is the main model)
print("\nUploading merged model to root...")
api.upload_folder(
    folder_path=f"{MODEL_FOLDER}/merged",
    repo_id=MODEL_REPO,
    repo_type="model",
    path_in_repo="",  # Empty string uploads to root
    commit_message="Upload merged model"
)

# Upload LoRA adapter to subfolder (optional but useful for reference)
print("\nUploading LoRA adapter...")
api.upload_folder(
    folder_path=f"{MODEL_FOLDER}/lora",
    repo_id=MODEL_REPO,
    repo_type="model",
    path_in_repo="lora_adapter",  # Keep adapter in subfolder
    commit_message="Upload LoRA adapter"
)

print(f"\n✓ All files uploaded successfully to https://huggingface.co/{MODEL_REPO}")

Demo

Create model card, merge, and upload to Hugging Face

Post-Training: Quantize

Even though we used QLoRA, your merged model is still full precision
To enable others to run locally, we want to create quantized versions (GGUF)
Use llama-quantize (part of llama.cpp) to quantize
Then upload to Hugging Face (in the root of the model repo so it’s discoverable)

Post-Training: Quantize

TEMP_FOLDER = "/content/tmp"

# Create a temporary location and clone llama.cpp
!mkdir -p {TEMP_FOLDER}
!cd {TEMP_FOLDER} && git clone https://github.com/ggml-org/llama.cpp

!cd {TEMP_FOLDER}/llama.cpp && cmake -B build
!cd {TEMP_FOLDER}/llama.cpp && cmake --build build --config Release --target llama-quantize -j 8

Post-Training: Quantize

from huggingface_hub import snapshot_download

HF_USERNAME = "simonguest" # Change this to your Hugging Face username
MODEL_NAME = "Qwen3-1.7B-code-explainer" # Change this to the repo of your merged model
MODEL_REPO = f"{HF_USERNAME}/{MODEL_NAME}"

# Working folders
MODEL_FOLDER = "/content/model"
GGUF_FOLDER = "/content/ggufs"

!mkdir -p {MODEL_FOLDER}
!mkdir -p {GGUF_FOLDER}

# Download the model from Hugging Face at full precision
snapshot_download(repo_id=MODEL_REPO, local_dir=f"{MODEL_FOLDER}", repo_type="model")

# Convert to GGUF (fp16 first)
!cd {TEMP_FOLDER}/llama.cpp && python convert_hf_to_gguf.py {MODEL_FOLDER} \
    --outfile {GGUF_FOLDER}/{MODEL_NAME}-F16.gguf \
    --outtype f16

# # Convert gp16 GGUF to Q4_K_M
!{TEMP_FOLDER}/llama.cpp/build/bin/llama-quantize {GGUF_FOLDER}/{MODEL_NAME}-F16.gguf \
    {GGUF_FOLDER}/{MODEL_NAME}-Q4_K_M.gguf \
    Q4_K_M

Post-Training: Quantize

from huggingface_hub import HfApi

# Initialize the API
api = HfApi()

# Upload GGUF files to root
print("\nUploading GGUF models to root...")
api.upload_folder(
    folder_path=f"{GGUF_FOLDER}",
    repo_id=MODEL_REPO,
    repo_type="model",
    path_in_repo="",  # GGUF files are typically found on the root level of a HF repo
    commit_message="Upload GGUF quantized models"
)

print(f"\nGGUFs uploaded successfully to https://huggingface.co/{MODEL_REPO}")

Demo

Quantize and upload to Hugging Face

Post-Training: Use in LM Studio!

Use LM Studio, search for your Hugging Face username
Download and test the GGUF of your newly fine-tuned model
(Don’t forget to turn off your A100 Colab instance, if you haven’t already!)

Post-Training: Use in LM Studio!

Post-Training: Use in LM Studio!

Demo

Search and chat in LM Studio!

Looking Ahead

This week’s assignment!
Our last lecture before spring break and final project!
Discuss ethical, IP, and safety concerns for Generative AI
Use an evidence-based approach for each of the areas

References

Dettmers, Tim, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. “QLoRA: Efficient Finetuning of Quantized LLMs.” Advances in Neural Information Processing Systems 36. https://arxiv.org/abs/2305.14314.

Hu, Edward J., Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. “LoRA: Low-Rank Adaptation of Large Language Models.” arXiv Preprint arXiv:2106.09685. https://arxiv.org/abs/2106.09685.

Module 7: Increasing Model Accuracy (Part 2)

Recap

Lesson Objectives

Model Training Stages

Model Training Stages

Model Training Stages

Model Training Stages

Model Training Stages

Supervised Fine-Tuning (SFT)

Supervised Fine-Tuning (SFT)

Supervised Fine-Tuning (SFT)

Supervised Fine-Tuning (SFT)

Supervised Fine-Tuning (SFT)

Supervised Fine-Tuning (SFT)

Introducing LoRA

Introducing LoRA

Introducing LoRA

Introducing QLoRA

Introducing QLoRA

Our Fine-Tuning Plan

Our Fine-Tuning Plan (1/2)

Our Fine-Tuning Plan (2/2)

Upload Training Files

Upload Training Files

Upload Training Files

Upload Training Files

Demo

Upload Training Files

Hands-on

The Training Notebook

The Training Notebook

Demo

Sidebar: How long will training take?

Training Hyperparameters

Training Hyperparameters

Training Hyperparameters

Training Hyperparameters

Training Hyperparameters

Training Hyperparameters

Training Hyperparameters

Training Hyperparameters

Training Hyperparameters

Training Hyperparameters

Demo

Weights & Biases

Weights & Biases

Weights & Biases

Weights & Biases

Weights & Biases

Weights & Biases

Weights & Biases

Weights & Biases

Demo

Let’s Do This!

Post-Training

Post-Training

Post-Training: Test

Post-Training: Create Model Card

Post-Training: Create Model Card

Post-Training: Create Model Card

Post-Training: Merge and Upload

Post-Training: Merge and Upload

Post-Training: Merge and Upload

Demo

Post-Training: Quantize

Post-Training: Quantize

Post-Training: Quantize

Post-Training: Quantize

Demo

Post-Training: Use in LM Studio!

Post-Training: Use in LM Studio!

Post-Training: Use in LM Studio!

Demo

Looking Ahead

Looking Ahead

References

References