--- license: mit language: - en base_model: Qwen/Qwen3-1.7B tags: - query-expansion - search - gguf - qwen3 pipeline_tag: text-generation --- # QMD Query Expansion Fine-Tuning Train small language models to expand search queries for [QMD](https://github.com/tobi/qmd)'s hybrid retrieval pipeline. ## What This Does Given a raw search query like `"auth config"`, the trained model produces structured expansions: ``` hyde: Authentication can be configured by setting the AUTH_SECRET environment variable. lex: authentication configuration lex: auth settings setup vec: how to configure authentication settings vec: authentication configuration options ``` These feed into QMD's three search backends: - **`lex:`** lines go to BM25 full-text search (short, keyword-focused) - **`vec:`** lines go to vector similarity search (natural language phrases) - **`hyde:`** is a hypothetical document passage for embedding-based retrieval ([HyDE](https://arxiv.org/abs/2212.10496) technique) ## Quick Start ### Cloud training via HuggingFace Jobs (no GPU needed) ```bash # 1. SFT: teach the model the output format (~45 min on A10G, ~$1.50) hf jobs uv run --flavor a10g-large --secrets HF_TOKEN --timeout 2h jobs/sft.py # 2. Evaluate against test queries (needs local GPU or use eval job) uv run eval.py tobil/qmd-query-expansion-1.7B # 3. Convert to GGUF for local deployment (Ollama, llama.cpp) uv run convert_gguf.py --size 1.7B # NOTE: GRPO is currently experimental and moved to finetune/experiments/grpo # if you want to run it manually, use: # cd finetune && uv run python experiments/grpo/grpo.py ``` ### Local training (if you have a GPU) ```bash uv run train.py sft --config configs/sft.yaml # Experimental GRPO cd finetune && uv run python experiments/grpo/grpo.py ``` ### Monitoring HF Jobs ```bash hf jobs ps # list running jobs hf jobs inspect # check status hf jobs logs # stream logs hf jobs cancel # cancel a job ``` ## Prompt Format All tools use the same prompt — **Qwen3 chat template with `/no_think`**: ``` <|im_start|>user /no_think Expand this search query: {query}<|im_end|> <|im_start|>assistant ``` The `/no_think` directive suppresses Qwen3's chain-of-thought mode, producing direct `lex:/vec:/hyde:` output without `` blocks. ## File Structure ``` finetune/ ├── reward.py # Scoring/reward function (single source of truth) ├── train.py # SFT training entrypoint ├── eval.py # Generate expansions and score them ├── convert_gguf.py # GGUF conversion for Ollama/llama.cpp ├── jobs/ │ ├── sft.py # Self-contained SFT for HuggingFace Jobs │ ├── eval.py # Self-contained eval for HuggingFace Jobs │ └── eval_common.py # Shared eval utilities ├── configs/ │ └── sft.yaml # SFT hyperparameters for Qwen3-1.7B ├── evals/ │ └── queries.txt # 31 test queries across 8 categories ├── experiments/ │ └── grpo/ # Experimental GRPO configuration and script (optional) ├── data/ # Training JSONL files (all concatenated for training) ├── dataset/ │ ├── prepare_data.py # Format for Qwen3 chat template, dedup, split │ ├── schema.py # Parse/normalize output format │ ├── validate_schema.py # Validate JSONL against schema │ ├── score_data.py # Score all examples using reward.py │ └── analyze_data.py # Analyze distribution and quality ├── SCORING.md # Detailed scoring rubric reference └── README.md # This file ``` ## Training Pipeline ### Stage 1: SFT (Supervised Fine-Tuning) Teaches the model the `lex:/vec:/hyde:` output format from labeled examples. | Parameter | Value | |-----------|-------| | Base model | `Qwen/Qwen3-1.7B` | | Method | LoRA (rank 16, alpha 32) | | Target modules | All projection layers (q/k/v/o/gate/up/down) | | Dataset | ~2,290 examples (train split) | | Effective batch size | 16 (4 x 4 gradient accumulation) | | Epochs | 5 | | Learning rate | 2e-4 (cosine schedule) | ```bash uv run train.py sft --config configs/sft.yaml uv run train.py sft --config configs/sft.yaml --dry-run # preview config ``` ### Stage 2: (Experimental) GRPO GRPO is currently treated as experimental and kept under `experiments/grpo/`. It is not part of the default production path for this repository. ```bash # Optional experimental GRPO run cd finetune && uv run python experiments/grpo/grpo.py ``` ## Evaluation `eval.py` generates expansions from a model and scores them against test queries: ```bash # Evaluate a SFT model uv run eval.py --model tobil/qmd-query-expansion-1.7B-sft # Evaluate an SFT output dir uv run eval.py outputs/sft # Verbose output with deduction details uv run eval.py tobil/qmd-query-expansion-1.7B -v # Optional: evaluate GRPO experimental output (if run) uv run eval.py outputs/grpo # Save detailed scores to JSON uv run eval.py tobil/qmd-query-expansion-1.7B -o scores.json ``` ## Reward Function `reward.py` is the single source of truth for scoring. It is used for evaluation and (optionally) as the GRPO reward signal in the experimental path. Five scoring dimensions (max 120 without hyde, 140 with): | Dimension | Points | What It Measures | |-----------|--------|------------------| | **Format** | 0-30 | Has lex/vec lines, no invalid lines | | **Diversity** | 0-30 | Multiple expansion types, diverse content, no query echoes | | **HyDE** | 0-20 | Present, 50-200 chars, single line, not repetitive | | **Quality** | 0-20 | Lex shorter than vec, natural language, preserves key terms | | **Entity** | -45 to +20 | Named entities preserved in lex and vec lines | | **Think bonus** | 0-20 | Reward for NOT using `` mode | **Hard failures** (instant 0.0): - Chat template leakage (`<|im_start|>`, `<|im_end|>`, etc.) - Any line without a valid `lex:`, `vec:`, or `hyde:` prefix ```bash # Self-test the reward function uv run reward.py ``` ## GGUF Conversion Merges base + SFT and (optionally) GRPO adapters into a single model, then produces quantized GGUF files for deployment: ```bash # Use preset for 1.7B uv run convert_gguf.py --size 1.7B # Custom models uv run convert_gguf.py --base Qwen/Qwen3-1.7B \ --sft tobil/qmd-query-expansion-1.7B-sft \ --grpo tobil/qmd-query-expansion-1.7B-grpo \ --output tobil/qmd-query-expansion-1.7B-gguf ``` ### Using with Ollama ```bash huggingface-cli download tobil/qmd-query-expansion-1.7B-gguf \ qmd-query-expansion-1.7B-q4_k_m.gguf --local-dir . echo 'FROM ./qmd-query-expansion-1.7B-q4_k_m.gguf' > Modelfile ollama create qmd-expand -f Modelfile ollama run qmd-expand ``` ## Data Pipeline All JSONL files in `data/` are concatenated for training. To prepare for training: ```bash # Format for Qwen3 chat template, deduplicate, split train/val uv run dataset/prepare_data.py # Validate data quality just validate ``` ## Architecture Notes The production training approach is currently **SFT-only**: 1. **SFT** establishes format compliance and basic query understanding. It uses a large LoRA (rank 16, all projection layers) because it needs to learn a new output format from scratch. 2. **GRPO** exists as an optional experimental path under `experiments/grpo/` and is not in the production training pipeline. The reward function is entirely rule-based (no LLM judge) which makes it fast, deterministic, and suitable as an RL signal. See `SCORING.md` for the full rubric. ## Training Results (Qwen3-1.7B, v2) ### SFT | Metric | Value | |--------|-------| | Final train loss | 0.472 | | Final eval loss | 0.304 | | Token accuracy (train) | 97.4% | | Token accuracy (eval) | 93.8% | | Epochs | 5 | | Hardware | A10G (24 GB VRAM) | ### Evaluation Scores | Model | Average Score | Excellent (30) | |-------|--------------|-----------------| | SFT | 92.0% | 30/30 | > GRPO scores are not tracked in this branch; see `experiments/grpo/` for historical > experimental results.