# QMD Query Expansion Fine-Tuning

## Overview

Train Qwen3-1.7B to expand search queries into structured `hyde:/lex:/vec:` output for QMD's hybrid retrieval pipeline.

## Output Format

```
hyde: A hypothetical document passage that would answer the query.
lex: keyword1
lex: keyword2
vec: semantic query reformulation
vec: another semantic variation
```

- `hyde:` always comes FIRST (one line max)
- `lex:` lines for BM25 keyword search (1-3 lines, short keywords)
- `vec:` lines for vector similarity search (1-3 lines, natural language)

## Model Repository

**Single destination**: `tobil/qmd-query-expansion-1.7B`

- No versioned directories (`-v1`, `-v2`, `-v4`, etc.)
- No separate `-sft` or `-grpo` repos for final models
- Update the main repo only when eval scores improve
- GGUF variants go to `tobil/qmd-query-expansion-1.7B-gguf`

## Training Data

All JSONL files in `data/` are training data:

```
data/
├── qmd_expansion_v2.jsonl
├── qmd_expansion_handcrafted_only.jsonl
├── qmd_only_sampled.jsonl
├── qmd_only_variants.jsonl
└── ... any additional .jsonl files
```

**All `.jsonl` files in `data/` should be concatenated for training runs.**

Each JSONL line: `{"input": "query", "output": "hyde:...\nlex:...\nvec:..."}`

## Data Generation Tools

| Script | Purpose |
|--------|---------|
| `dataset/generate_data.py` | Generate via Claude API (high quality) |
| `dataset/generate_data_offline.py` | Transform from HuggingFace datasets |
| `dataset/prepare_data.py` | Format for Qwen3 chat template |
| `dataset/clean_data.py` | Detect and fix technical term issues |
| `generate_only_variants.py` | Generate `/only:lex` and `/only:vec` variants |

## Local Training Output

All training outputs go to `outputs/` (gitignored):

```
outputs/
├── sft/           # SFT checkpoint
└── grpo/          # GRPO checkpoint
```

## Training Pipeline

Always use **Qwen3-1.7B** as the base model unless explicitly stated otherwise.

Training can run **locally** (requires CUDA GPU) or via **HuggingFace Jobs** (cloud GPU, no local hardware needed).

### Stage 1: SFT

```bash
# Local (requires CUDA)
uv run train.py sft --config configs/sft.yaml
# Output: outputs/sft/

# Cloud (HuggingFace Jobs - no local GPU needed)
hf jobs uv run --flavor a10g-large --secrets HF_TOKEN --timeout 2h jobs/sft.py
```

### Stage 2: GRPO

```bash
# Local (requires CUDA)
uv run train.py grpo --config configs/grpo.yaml
# Output: outputs/grpo/

# Cloud (HuggingFace Jobs - no local GPU needed)
hf jobs uv run --flavor a10g-large --secrets HF_TOKEN --timeout 4h jobs/grpo.py
```

### HuggingFace Jobs

If no local CUDA device is available, use `hf jobs` to run training in the cloud:

```bash
hf jobs ps                    # List running jobs
hf jobs logs <job-id>         # Stream logs
hf jobs inspect <job-id>      # Check status
hf jobs cancel <job-id>       # Cancel a job
```

The `jobs/` directory contains self-contained scripts that include all dependencies inline.

### Evaluation

```bash
# Eval local model
uv run eval.py --model ./outputs/grpo

# Eval HuggingFace model
uv run eval.py --model tobil/qmd-query-expansion-1.7B

# Save eval results to file
uv run eval.py --model ./outputs/grpo -o eval_results.json
```

## Quality Scoring

`reward.py` is the single source of truth for scoring:

```bash
# Self-test the reward function
uv run reward.py
```

See `SCORING.md` for the full rubric.

## Deployment Rules

**Never upload without eval.** Every model push must include eval results.

### Checklist

1. Train SFT on all `data/*.jsonl` → `outputs/sft/`
2. Train GRPO on top of SFT → `outputs/grpo/`
3. **Run eval on local model**: `uv run eval.py --model ./outputs/grpo -o eval_results.json`
4. Compare against current deployed model's eval
5. If eval improves:
   - Push to `tobil/qmd-query-expansion-1.7B`
   - **Include eval output in the model card / commit message**
6. Convert to GGUF and update `tobil/qmd-query-expansion-1.7B-gguf`
7. Update `src/llm.ts` DEFAULT_GENERATE_MODEL if repo name changed

## Key Files

```
finetune/
├── reward.py          # Scoring function (single source of truth)
├── train.py           # Unified SFT + GRPO training
├── eval.py            # Generate and score expansions
├── convert_gguf.py    # GGUF conversion
├── SCORING.md         # Detailed scoring rubric
├── CLAUDE.md          # This file
├── data/              # All training JSONL files
├── outputs/           # Local training outputs (gitignored)
├── dataset/           # Data generation scripts
├── jobs/              # Self-contained HuggingFace Jobs scripts
├── configs/           # Training configs (sft.yaml, grpo.yaml)
└── evals/             # Test queries and results
```