BALANCED_DISTRIBUTION.md 5.5 KB

QMD Training Data - Balanced Distribution Summary

Overview

The training data has been rebalanced to reduce excessive tech focus while maintaining adequate technical coverage for QMD's use case. The new distribution emphasizes diverse life topics while keeping tech at a reasonable 15%.

Distribution Comparison

Before (Original Data)

Technical:        ~50% ████████████████████████████████████████
How-to:           ~45% █████████████████████████████████████
What-is:          ~40% █████████████████████████████████
Other:            ~15% ████████████
Short queries:    10%  ████████
Temporal:         1.6% █
Named entities:   3.4% ██

After (Balanced Approach)

Category                    Percentage
────────────────────────────────────────
Health & Wellness           12%  █████████
Finance & Business          12%  █████████
Technology                  15%  ███████████
Home & Garden               10%  ████████
Food & Cooking              10%  ████████
Travel & Geography          10%  ████████
Hobbies & Crafts            10%  ████████
Education & Learning         8%  ██████
Arts & Culture               8%  ██████
Lifestyle & Relationships    5%  ████
────────────────────────────────────────
Short queries (1-2 words):  20%
Temporal (2025/2026):       15%
Named entities:            10%+

Key Improvements

1. Category Diversity

New Non-Tech Categories Added:

  • Health & Wellness: Meditation, fitness, nutrition, mental health
  • Finance & Business: Budgeting, investing, career, entrepreneurship
  • Home & Garden: DIY, repairs, cleaning, gardening, organization
  • Food & Cooking: Recipes, techniques, meal planning, nutrition
  • Travel & Geography: Travel planning, destinations, geography facts
  • Hobbies & Crafts: Photography, art, music, woodworking, knitting
  • Education & Learning: Study techniques, languages, online courses
  • Arts & Culture: Art history, music, film, theater, literature
  • Lifestyle & Relationships: Habits, relationships, parenting, minimalism

2. Temporal Queries (2025/2026)

Updated to use current era years for recency queries:

  • "latest research 2026"
  • "Shopify updates 2025"
  • "what changed in React 2026"
  • "AI developments 2025"

This ensures the model learns to handle queries from the current time period.

3. Short Query Coverage

Expanded from 47 to 144+ short keywords across all categories:

  • Tech: auth, config, api, cache, deploy
  • Health: meditate, hydrate, stretch, exercise
  • Finance: budget, save, invest, taxes
  • Home: clean, organize, repair, garden
  • Food: cook, bake, recipe, meal
  • Travel: travel, pack, passport, hotel
  • Hobbies: photo, draw, paint, knit, guitar
  • Education: study, learn, course, exam
  • Arts: art, music, film, dance
  • Life: habit, routine, organize, parent

Usage

Quick Start - Use Balanced Data

cd finetune

# Add 500 balanced examples
cat data/qmd_expansion_balanced.jsonl >> data/qmd_expansion_v2.jsonl

# Prepare with enhanced short query templates  
uv run dataset/prepare_data.py --add-short 2

# Train
uv run train.py sft --config configs/sft.yaml

Generate Fresh Data with Claude API

# Set API key
export ANTHROPIC_API_KEY=your_key

# Generate 300 balanced examples
uv run dataset/generate_data.py --count 300 \
  --output data/qmd_expansion_fresh.jsonl

# Analyze distribution
uv run dataset/analyze_data.py --input data/qmd_expansion_fresh.jsonl

# Prepare for training
uv run dataset/prepare_data.py --input data/qmd_expansion_fresh.jsonl

Generate Even More Balanced Examples

# Generate 500 life-focused examples (15% tech)
uv run dataset/generate_balanced.py

# Or generate 265 additional diverse examples
uv run dataset/generate_diverse.py

File Summary

Modified Files:

  • dataset/generate_data.py - Added category weights (15% tech), 2025/2026 dates
  • dataset/prepare_data.py - Expanded SHORT_QUERIES from 47→144, templates 5→16

New Files:

  • dataset/generate_balanced.py - Life-focused generator (500 examples)
  • dataset/generate_diverse.py - Philosophy/History/Geography/Trivia generator (265 examples)
  • dataset/analyze_data.py - Dataset analysis and quality reporting
  • DATA_IMPROVEMENTS.md - Detailed improvement documentation

Generated Data:

  • data/qmd_expansion_balanced.jsonl - 500 balanced examples
  • data/qmd_expansion_diverse_addon.jsonl - 265 diverse examples

Expected Benefits

  1. Better Short Query Handling: 20% coverage vs 10% before
  2. Named Entity Preservation: 10%+ coverage vs 3.4% before
  3. Temporal Understanding: 15% with 2025/2026 vs 1.6% before
  4. Domain Diversity: 10 categories vs tech-only before
  5. Life-Document Search: Better at searching personal notes on health, finance, hobbies

Next Steps

  1. Merge balanced examples into training set
  2. Retrain model with improved distribution
  3. Evaluate using evals/queries.txt
  4. Monitor scores on temporal/named-entity/short queries
  5. Iterate based on results

Generated: 2026-01-30