|
|
@@ -0,0 +1,164 @@
|
|
|
+# QMD Query Expansion Fine-Tuning
|
|
|
+
|
|
|
+## Overview
|
|
|
+
|
|
|
+Train Qwen3-1.7B to expand search queries into structured `hyde:/lex:/vec:` output for QMD's hybrid retrieval pipeline.
|
|
|
+
|
|
|
+## Output Format
|
|
|
+
|
|
|
+```
|
|
|
+hyde: A hypothetical document passage that would answer the query.
|
|
|
+lex: keyword1
|
|
|
+lex: keyword2
|
|
|
+vec: semantic query reformulation
|
|
|
+vec: another semantic variation
|
|
|
+```
|
|
|
+
|
|
|
+- `hyde:` always comes FIRST (one line max)
|
|
|
+- `lex:` lines for BM25 keyword search (1-3 lines, short keywords)
|
|
|
+- `vec:` lines for vector similarity search (1-3 lines, natural language)
|
|
|
+
|
|
|
+## Model Repository
|
|
|
+
|
|
|
+**Single destination**: `tobil/qmd-query-expansion-1.7B`
|
|
|
+
|
|
|
+- No versioned directories (`-v1`, `-v2`, `-v4`, etc.)
|
|
|
+- No separate `-sft` or `-grpo` repos for final models
|
|
|
+- Update the main repo only when eval scores improve
|
|
|
+- GGUF variants go to `tobil/qmd-query-expansion-1.7B-gguf`
|
|
|
+
|
|
|
+## Training Data
|
|
|
+
|
|
|
+All JSONL files in `data/` are training data:
|
|
|
+
|
|
|
+```
|
|
|
+data/
|
|
|
+├── qmd_expansion_v2.jsonl
|
|
|
+├── qmd_expansion_handcrafted_only.jsonl
|
|
|
+├── qmd_only_sampled.jsonl
|
|
|
+├── qmd_only_variants.jsonl
|
|
|
+└── ... any additional .jsonl files
|
|
|
+```
|
|
|
+
|
|
|
+**All `.jsonl` files in `data/` should be concatenated for training runs.**
|
|
|
+
|
|
|
+Each JSONL line: `{"input": "query", "output": "hyde:...\nlex:...\nvec:..."}`
|
|
|
+
|
|
|
+## Data Generation Tools
|
|
|
+
|
|
|
+| Script | Purpose |
|
|
|
+|--------|---------|
|
|
|
+| `dataset/generate_data.py` | Generate via Claude API (high quality) |
|
|
|
+| `dataset/generate_data_offline.py` | Transform from HuggingFace datasets |
|
|
|
+| `dataset/prepare_data.py` | Format for Qwen3 chat template |
|
|
|
+| `dataset/clean_data.py` | Detect and fix technical term issues |
|
|
|
+| `generate_only_variants.py` | Generate `/only:lex` and `/only:vec` variants |
|
|
|
+
|
|
|
+## Local Training Output
|
|
|
+
|
|
|
+All training outputs go to `outputs/` (gitignored):
|
|
|
+
|
|
|
+```
|
|
|
+outputs/
|
|
|
+├── sft/ # SFT checkpoint
|
|
|
+└── grpo/ # GRPO checkpoint
|
|
|
+```
|
|
|
+
|
|
|
+## Training Pipeline
|
|
|
+
|
|
|
+Always use **Qwen3-1.7B** as the base model unless explicitly stated otherwise.
|
|
|
+
|
|
|
+Training can run **locally** (requires CUDA GPU) or via **HuggingFace Jobs** (cloud GPU, no local hardware needed).
|
|
|
+
|
|
|
+### Stage 1: SFT
|
|
|
+
|
|
|
+```bash
|
|
|
+# Local (requires CUDA)
|
|
|
+uv run train.py sft --config configs/sft.yaml
|
|
|
+# Output: outputs/sft/
|
|
|
+
|
|
|
+# Cloud (HuggingFace Jobs - no local GPU needed)
|
|
|
+hf jobs uv run --flavor a10g-large --secrets HF_TOKEN --timeout 2h jobs/sft.py
|
|
|
+```
|
|
|
+
|
|
|
+### Stage 2: GRPO
|
|
|
+
|
|
|
+```bash
|
|
|
+# Local (requires CUDA)
|
|
|
+uv run train.py grpo --config configs/grpo.yaml
|
|
|
+# Output: outputs/grpo/
|
|
|
+
|
|
|
+# Cloud (HuggingFace Jobs - no local GPU needed)
|
|
|
+hf jobs uv run --flavor a10g-large --secrets HF_TOKEN --timeout 4h jobs/grpo.py
|
|
|
+```
|
|
|
+
|
|
|
+### HuggingFace Jobs
|
|
|
+
|
|
|
+If no local CUDA device is available, use `hf jobs` to run training in the cloud:
|
|
|
+
|
|
|
+```bash
|
|
|
+hf jobs ps # List running jobs
|
|
|
+hf jobs logs <job-id> # Stream logs
|
|
|
+hf jobs inspect <job-id> # Check status
|
|
|
+hf jobs cancel <job-id> # Cancel a job
|
|
|
+```
|
|
|
+
|
|
|
+The `jobs/` directory contains self-contained scripts that include all dependencies inline.
|
|
|
+
|
|
|
+### Evaluation
|
|
|
+
|
|
|
+```bash
|
|
|
+# Eval local model
|
|
|
+uv run eval.py --model ./outputs/grpo
|
|
|
+
|
|
|
+# Eval HuggingFace model
|
|
|
+uv run eval.py --model tobil/qmd-query-expansion-1.7B
|
|
|
+
|
|
|
+# Save eval results to file
|
|
|
+uv run eval.py --model ./outputs/grpo -o eval_results.json
|
|
|
+```
|
|
|
+
|
|
|
+## Quality Scoring
|
|
|
+
|
|
|
+`reward.py` is the single source of truth for scoring:
|
|
|
+
|
|
|
+```bash
|
|
|
+# Self-test the reward function
|
|
|
+uv run reward.py
|
|
|
+```
|
|
|
+
|
|
|
+See `SCORING.md` for the full rubric.
|
|
|
+
|
|
|
+## Deployment Rules
|
|
|
+
|
|
|
+**Never upload without eval.** Every model push must include eval results.
|
|
|
+
|
|
|
+### Checklist
|
|
|
+
|
|
|
+1. Train SFT on all `data/*.jsonl` → `outputs/sft/`
|
|
|
+2. Train GRPO on top of SFT → `outputs/grpo/`
|
|
|
+3. **Run eval on local model**: `uv run eval.py --model ./outputs/grpo -o eval_results.json`
|
|
|
+4. Compare against current deployed model's eval
|
|
|
+5. If eval improves:
|
|
|
+ - Push to `tobil/qmd-query-expansion-1.7B`
|
|
|
+ - **Include eval output in the model card / commit message**
|
|
|
+6. Convert to GGUF and update `tobil/qmd-query-expansion-1.7B-gguf`
|
|
|
+7. Update `src/llm.ts` DEFAULT_GENERATE_MODEL if repo name changed
|
|
|
+
|
|
|
+## Key Files
|
|
|
+
|
|
|
+```
|
|
|
+finetune/
|
|
|
+├── reward.py # Scoring function (single source of truth)
|
|
|
+├── train.py # Unified SFT + GRPO training
|
|
|
+├── eval.py # Generate and score expansions
|
|
|
+├── convert_gguf.py # GGUF conversion
|
|
|
+├── SCORING.md # Detailed scoring rubric
|
|
|
+├── CLAUDE.md # This file
|
|
|
+├── data/ # All training JSONL files
|
|
|
+├── outputs/ # Local training outputs (gitignored)
|
|
|
+├── dataset/ # Data generation scripts
|
|
|
+├── jobs/ # Self-contained HuggingFace Jobs scripts
|
|
|
+├── configs/ # Training configs (sft.yaml, grpo.yaml)
|
|
|
+└── evals/ # Test queries and results
|
|
|
+```
|