# QMD Query Expansion Model Finetuning Finetune small Qwen models for QMD's query expansion task. ## Goal Train models that convert user queries into retrieval-optimized outputs: ``` Input: "how to configure authentication" Output: lex: authentication setup lex: auth configuration vec: how to set up user authentication in the application hyde: To configure authentication, set the AUTH_SECRET environment variable and enable the auth middleware in your application config. ``` ## Output Format | Type | Purpose | Count | |------|---------|-------| | `lex` | BM25 keyword variations | 1-3 | | `vec` | Semantic reformulations | 1-3 | | `hyde` | Hypothetical document passage | 0-1 | ## Trained Models | Model | HuggingFace | Format Compliance | Status | |-------|-------------|-------------------|--------| | **Qwen3-0.6B (SFT)** | [tobil/qmd-query-expansion-0.6B](https://huggingface.co/tobil/qmd-query-expansion-0.6B) | **95%** | Recommended | | Qwen3-1.7B v2 (SFT) | [tobil/qmd-query-expansion-1.7B-v2](https://huggingface.co/tobil/qmd-query-expansion-1.7B-v2) | TBD | Completed | | Qwen3-0.6B (GRPO) | [tobil/qmd-query-expansion-0.6B-grpo](https://huggingface.co/tobil/qmd-query-expansion-0.6B-grpo) | 0% | Failed - lost formatting | | Qwen3-1.7B v1 (SFT) | [tobil/qmd-query-expansion-1.7B](https://huggingface.co/tobil/qmd-query-expansion-1.7B) | 0% | Training issues | | Qwen3-0.6B (baseline) | - | 0% | Untrained | **Note:** GRPO (RL) training caused catastrophic forgetting - the model lost all learned formatting. ## Training Dataset - **Dataset**: [tobil/qmd-query-expansion-train](https://huggingface.co/datasets/tobil/qmd-query-expansion-train) - **Source**: Transformed from [s-emanuilov/query-expansion](https://huggingface.co/datasets/s-emanuilov/query-expansion) (CC BY 4.0) - **Size**: 5,157 examples (train: 4,641, eval: 516) - **Format**: Chat messages with user query and assistant response in lex/vec/hyde format ## Directory Structure ``` finetune/ ├── README.md # This file ├── DATASETS.md # Dataset research findings ├── TRAINING_JOBS.md # HuggingFace Jobs tracking ├── generate_data_offline.py # Transform s-emanuilov dataset to QMD format ├── prepare_data.py # Upload to HuggingFace Hub ├── train_0.6B.py # Training script for 0.6B model ├── train_1.7B.py # Training script for 1.7B model ├── train_grpo.py # GRPO RL training (optional) ├── evaluate_model.py # Evaluate finetuned models ├── evaluate_baseline.py # Evaluate base models ├── data/ │ ├── qmd_expansion.jsonl # Generated training data │ └── train/ # Prepared chat format └── evaluation_*.json # Evaluation results ``` ## Quick Start ### 1. Generate Training Data ```bash # Transform s-emanuilov dataset to QMD format (no API needed) uv run generate_data_offline.py ``` ### 2. Prepare and Upload Dataset ```bash # Convert to chat format and upload to HuggingFace Hub uv run prepare_data.py ``` ### 3. Train on HuggingFace Jobs ```bash # Train Qwen3-0.6B (recommended) hf jobs uv run --flavor a10g-large --timeout 3h --secrets HF_TOKEN \ "https://huggingface.co/tobil/qmd-training-scripts/resolve/main/train_0.6B.py" ``` ### 4. Evaluate ```bash # Evaluate finetuned model uv run evaluate_model.py --model tobil/qmd-query-expansion-0.6B --base-model Qwen/Qwen3-0.6B # Compare to baseline uv run evaluate_baseline.py --model Qwen/Qwen3-0.6B --num-queries 10 ``` ### 5. Export to GGUF ```bash # Convert to GGUF for node-llama-cpp (TODO) uv run export_gguf.py --model tobil/qmd-query-expansion-0.6B --quantization Q8_0 ``` ## Training Configuration | Parameter | Value | |-----------|-------| | Method | LoRA (rank 16, alpha 32) | | Learning Rate | 2e-4 | | Epochs | 3 | | Batch Size | 4 (with 4x gradient accumulation) | | Max Seq Length | 512 | | Target Modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj | ## Prompt Format The models are trained on this simple prompt format: ``` Expand this search query: {query} ``` The model responds with lex/vec/hyde lines directly. ## Evaluation Results ### 0.6B Finetuned Model (95% format compliance) Sample outputs: | Query | Output | |-------|--------| | `how to configure authentication` | lex: steps for setting up authentication
vec: steps for setting up authentication in cloud services
hyde: The process of configure authentication... | | `kubernetes vs docker swarm` | lex: kubernetes and docker swarm
vec: kubernetes vs docker swarm
hyde: Kubernetes vs docker swarm is an important concept... | | `cors error fix` | lex: how to fix cors
vec: how to fix cors issues in web apps
hyde: The topic of cors error fix guide... | ### Baseline Model (0% format compliance) The untrained model generates random prose, code blocks, or repetitive text with no understanding of the lex/vec/hyde format. ## Future Work - [ ] Export to GGUF for local inference - [ ] Integrate into QMD as default query expansion model - [ ] GRPO training for improved diversity (optional) - [ ] Fix 1.7B training issues