The training data has been rebalanced to reduce excessive tech focus while maintaining adequate technical coverage for QMD's use case. The new distribution emphasizes diverse life topics while keeping tech at a reasonable 15%.
Technical: ~50% ████████████████████████████████████████
How-to: ~45% █████████████████████████████████████
What-is: ~40% █████████████████████████████████
Other: ~15% ████████████
Short queries: 10% ████████
Temporal: 1.6% █
Named entities: 3.4% ██
Category Percentage
────────────────────────────────────────
Health & Wellness 12% █████████
Finance & Business 12% █████████
Technology 15% ███████████
Home & Garden 10% ████████
Food & Cooking 10% ████████
Travel & Geography 10% ████████
Hobbies & Crafts 10% ████████
Education & Learning 8% ██████
Arts & Culture 8% ██████
Lifestyle & Relationships 5% ████
────────────────────────────────────────
Short queries (1-2 words): 20%
Temporal (2025/2026): 15%
Named entities: 10%+
New Non-Tech Categories Added:
Updated to use current era years for recency queries:
This ensures the model learns to handle queries from the current time period.
Expanded from 47 to 144+ short keywords across all categories:
cd finetune
# Add 500 balanced examples
cat data/qmd_expansion_balanced.jsonl >> data/qmd_expansion_v2.jsonl
# Prepare with enhanced short query templates
uv run dataset/prepare_data.py --add-short 2
# Train
uv run train.py sft --config configs/sft.yaml
# Set API key
export ANTHROPIC_API_KEY=your_key
# Generate 300 balanced examples
uv run dataset/generate_data.py --count 300 \
--output data/qmd_expansion_fresh.jsonl
# Analyze distribution
uv run dataset/analyze_data.py --input data/qmd_expansion_fresh.jsonl
# Prepare for training
uv run dataset/prepare_data.py --input data/qmd_expansion_fresh.jsonl
# Generate 500 life-focused examples (15% tech)
uv run dataset/generate_balanced.py
# Or generate 265 additional diverse examples
uv run dataset/generate_diverse.py
dataset/generate_data.py - Added category weights (15% tech), 2025/2026 datesdataset/prepare_data.py - Expanded SHORT_QUERIES from 47→144, templates 5→16dataset/generate_balanced.py - Life-focused generator (500 examples)dataset/generate_diverse.py - Philosophy/History/Geography/Trivia generator (265 examples)dataset/analyze_data.py - Dataset analysis and quality reportingDATA_IMPROVEMENTS.md - Detailed improvement documentationdata/qmd_expansion_balanced.jsonl - 500 balanced examplesdata/qmd_expansion_diverse_addon.jsonl - 265 diverse examplesevals/queries.txtGenerated: 2026-01-30