Module 6 Assignment: Generate a Fine-tuning Dataset

Objective: Generate a synthetic training dataset that you will use to fine-tune a model next week in Module 7.

Background:

Fine-tuning requires high-quality training data in the form of conversation pairs (user/assistant messages) stored in JSON Lines format. Using the generate-synthetic.ipynb notebook as a starting point, you will design and generate your own dataset for a use case that interests you.

Your dataset should teach a model a specific style, structure, or behavior: something that prompt engineering alone would struggle to achieve consistently at scale.

Some initial ideas:

Language Tutor: Generate conversation pairs where a tutor explains vocabulary and grammar in a specific language (e.g., Spanish, Japanese, Mandarin) at a beginner level, using simple analogies
Fitness Coach: Generate workout recommendations and exercise explanations tailored to different fitness levels, always responding in a motivational tone
Recipe Generator: Generate cooking instructions that always follow a structured format (ingredients, steps, tips)
Customer Support Agent: Generate support conversations for a fictional product, including handling complaints, answering FAQs, and escalating issues appropriately
Creative Writing Helper: Generate story continuations or writing feedback in a particular genre (sci-fi, mystery, romance), maintaining consistent narrative style

Requirements

Pick a use case that has a clear style, structure, or behavior you want the model to learn
Define diversity dimensions for your dataset. These should include at least three dimensions (e.g., topics, difficulty levels, response lengths, formats). Refer to the slides for examples of diversity dimensions.
Generate three datasets using synthetic data generation:
- train.jsonl (at least 500 examples, recommend 5000+ for more complex interactions)
- validation.jsonl (at least 100 examples, recommend 10% of the size of your training dataset)
- test.jsonl (at least 10 examples)
Review a sample of your data for quality. Spot-check and fix or regenerate any that are low quality.

Deliverable

A GitHub repository (can be the same repo you’ve been using for assignments) containing:

A Colab/Jupyter notebook showing your data generation code
Uses environment variables for any API keys (e.g., OPENROUTER_API_KEY). Please do not include your API key in your notebook!
The generated dataset files: train.jsonl, validation.jsonl, and test.jsonl
Markdown cells in your notebook explaining:
- What use case you chose and why
- Your diversity dimensions and how they shape the dataset
- Observations from your quality review (what looked good, what needed fixing)

Hints

Start by copying the generate-synthetic.ipynb notebook and modifying it for your use case
You can run this notebook locally or on Colab. If you use Colab, don’t forget to download your .jsonl files before you terminate your instance (otherwise you’ll lose your data!)
Pick a large, capable model that will generate quality data. I’ve used nvidia/nemotron-3-nano-30b-a3b:free in the demo workbook, but you may want to switch to one of the GPT models.
Be careful with the model you select to generate the data (thousands of calls will run up charges on paid models). I’ve seen around $5 for 5000 examples using openai/gpt-5.2-chat
Smaller datasets with high quality are better than large datasets with noise. Start with 10-50 training examples and review them before scaling up
Think carefully about your prompt. The quality of your synthetic data depends almost entirely on how well you describe the task to the generation model
Use structured outputs (Pydantic models) to keep your generated data consistent, as demonstrated in the notebook
You will use these datasets to fine-tune a model in Module 7, so pick a use case you’re genuinely interested in!