Tobi Lutke 8a1c4cdab0 Add 1.7B and 4B GRPO training and GGUF conversion scripts 4 months ago
..
configs 6062dc769f Add named entity extraction to GRPO reward function 4 months ago
data 32706a720f Refactor finetune folder: train/rl scripts with YAML configs 4 months ago
dataset 32706a720f Refactor finetune folder: train/rl scripts with YAML configs 4 months ago
evals f96766cce8 Fix GRPO model loading to use SFT base first 4 months ago
.gitignore 7cca164dd9 Add query expansion model finetuning infrastructure 4 months ago
README.md b9b1b39a76 Update README with separate model repos 4 months ago
SCORING.md 6062dc769f Add named entity extraction to GRPO reward function 4 months ago
convert_1.7B_gguf.py 8a1c4cdab0 Add 1.7B and 4B GRPO training and GGUF conversion scripts 4 months ago
convert_4B_gguf.py 8a1c4cdab0 Add 1.7B and 4B GRPO training and GGUF conversion scripts 4 months ago
rl.py dc8f5a2335 Strict format validation: every line must be lex:/vec:/hyde: 4 months ago
train.py 32706a720f Refactor finetune folder: train/rl scripts with YAML configs 4 months ago
train_1.7B_grpo.py 8a1c4cdab0 Add 1.7B and 4B GRPO training and GGUF conversion scripts 4 months ago
train_4B_grpo.py 8a1c4cdab0 Add 1.7B and 4B GRPO training and GGUF conversion scripts 4 months ago
tui.py 2648512b7c Fix TUI to load GRPO models with SFT base first 4 months ago

README.md

QMD Query Expansion Model Finetuning

Finetune small Qwen models for QMD's query expansion task.

Goal

Train models that convert user queries into retrieval-optimized outputs:

Input: "how to configure authentication"

Output:
lex: authentication setup
lex: auth configuration
vec: how to set up user authentication in the application
hyde: To configure authentication, set the AUTH_SECRET environment variable and enable the auth middleware in your application config.

Output Format

Type Purpose Count
lex: BM25 keyword variations (short, keyword-focused) 1-3
vec: Semantic reformulations (natural language) 1-3
hyde: Hypothetical document passage (50-150 chars) 0-1

Trained Models

Size SFT Adapter GRPO Adapter Base Model
0.6B tobil/qmd-query-expansion-0.6B-v4 tobil/qmd-query-expansion-0.6B-v4-grpo Qwen/Qwen3-0.6B
1.7B tobil/qmd-query-expansion-1.7B-sft tobil/qmd-query-expansion-1.7B-grpo Qwen/Qwen3-1.7B
4B tobil/qmd-query-expansion-4B-sft tobil/qmd-query-expansion-4B-grpo Qwen/Qwen3-4B

Loading Models

from peft import PeftModel
from transformers import AutoModelForCausalLM

# Load SFT model (recommended)
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-1.7B", torch_dtype="bfloat16")
model = PeftModel.from_pretrained(base, "tobil/qmd-query-expansion-1.7B-sft")

# Load GRPO model (requires SFT first)
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-1.7B", torch_dtype="bfloat16")
model = PeftModel.from_pretrained(base, "tobil/qmd-query-expansion-1.7B-sft")
model = model.merge_and_unload()
model = PeftModel.from_pretrained(model, "tobil/qmd-query-expansion-1.7B-grpo")

Note on GRPO models: GRPO adapters were trained on top of merged SFT weights, so you must load and merge SFT first before applying GRPO.

Prompt Format

The models use Qwen3 chat template with /no_think to disable thinking mode.

Inference (Python)

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")

# CRITICAL: Use /no_think to disable Qwen3's thinking mode
messages = [{"role": "user", "content": f"/no_think Expand this search query: {query}"}]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# Generate and decode
output = tokenizer.decode(tokens, skip_special_tokens=True)

# Extract assistant response (skip_special_tokens converts to "user\n...\nassistant\n...")
if "\nassistant\n" in output:
    expansion = output.split("\nassistant\n")[-1].strip()

Raw Format

<|im_start|>user
/no_think Expand this search query: auth<|im_end|>
<|im_start|>assistant
lex: authentication configuration
lex: auth settings
vec: how to configure authentication
vec: authentication setup guide
hyde: To configure authentication, set AUTH_SECRET in your environment.<|im_end|>

See PROMPT_FORMAT.md for complete specification.

Directory Structure

finetune/
├── train.py              # SFT training (uses YAML config)
├── rl.py                 # GRPO/RL training (uses YAML config)
├── tui.py                # Interactive testing interface
├── configs/
│   ├── sft_v4.yaml       # SFT training config
│   └── grpo_v4.yaml      # GRPO training config
├── evals/
│   ├── run.py            # Generate model outputs to JSONL
│   ├── score.py          # Score outputs from JSONL
│   └── queries.txt       # Test queries
├── dataset/
│   ├── prepare_data.py   # Prepare training data
│   ├── clean_data.py     # Data quality improvements
│   └── generate_data*.py # Generate from source datasets
├── PROMPT_FORMAT.md      # Prompt format specification
├── SCORING.md            # Scoring criteria
└── data/
    └── train/            # Prepared training data

Quick Start

1. Prepare Training Data

cd dataset
uv run prepare_data.py --add-short 5

2. Train with YAML Config

# Local training
uv run train.py --config configs/sft_v4.yaml

# Or on HuggingFace Jobs
hf jobs uv run --flavor a10g-large --timeout 2h --secrets HF_TOKEN \
  "https://huggingface.co/datasets/tobil/qmd-query-expansion-train-v2/resolve/main/train_sft_v4.py"

3. Evaluate

# Generate outputs
uv run evals/run.py --model tobil/qmd-query-expansion-0.6B-v4

# Score them
uv run evals/score.py evals/results_tobil_qmd-query-expansion-0.6B-v4.jsonl

4. Interactive Testing

uv run tui.py

Training Configuration

Default SFT config (configs/sft_v4.yaml):

Parameter Value
Method LoRA (rank 16, alpha 32)
Learning Rate 2e-4
Epochs 3
Batch Size 4 (with 4x gradient accumulation)
Max Seq Length 512
Target Modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Training Dataset

Key improvements in v2:

  • Short query examples with proper expansions
  • Hyde passages truncated to 150 chars
  • Key term preservation in lex lines

Evaluation Results

SFT v4 (98.8% average score)

All 21 test queries rated "Excellent":

Query Score Rating
how to configure authentication 99% Excellent
auth 95% Excellent
git rebase vs merge 100% Excellent
react useEffect cleanup 100% Excellent

GRPO v4 (89.7% - with SFT base)

All 26 test queries rated "Excellent" when loaded correctly (SFT first, then GRPO adapter).

Query Score Rating
AWS Lambda functions 96% Excellent
typescript async await 92% Excellent
kubernetes vs docker swarm 92% Excellent
who is TDS motorsports 89% Excellent

Important: Loading GRPO directly on base model results in 0% (catastrophic drift) because GRPO was trained on merged SFT weights.

Known Issues

  • GRPO loading: Requires SFT adapter loaded first before GRPO adapter (see model card note above)
  • Key term preservation: Some lex lines still too generic (missing query key terms)
  • Entity scoring: Named entity detection is heuristic-based, may miss some cases