Explorar o código

Add HF Jobs scripts, temporal query examples, and training results

- jobs/sft.py and jobs/grpo.py: self-contained scripts for
  `hf jobs uv run` (no local GPU needed)
- 12 temporal/recency query examples in training data (e.g. "recent
  news about Shopify" -> lex with years 2025/2026)
- 4 temporal test queries in evals/queries.txt
- README updated with HF Jobs workflow, training results, and
  updated file structure
- Remove .beads tracking

SFT and GRPO successfully trained on A10G via HF Jobs:
  SFT: eval loss 0.321, token accuracy 92.4%
  GRPO: mean reward 0.757, 200 steps, KL 0.00048

Co-Authored-By: Claude (claude-fudge-eap-cc) <noreply@anthropic.com>
Tobi Lutke hai 3 meses
pai
achega
5ab78d00a2

+ 0 - 48
.beads/issues.jsonl

@@ -1,48 +0,0 @@
-{"id":"qmd-0ic","title":"in qmd status, list all the additonal contexts under the collections that match","status":"closed","priority":2,"issue_type":"task","created_at":"2025-12-12T16:41:42.126194-05:00","updated_at":"2025-12-12T17:14:48.268119-05:00","closed_at":"2025-12-12T17:14:48.268119-05:00"}
-{"id":"qmd-18s","title":"Move cleanup/maintenance DB operations to store.ts","description":"Move cleanup operations from cleanup() command to store.ts. Create methods like deleteInactiveDocuments(), vacuumDatabase(), cleanupOrphanedContent(), etc.","status":"closed","priority":2,"issue_type":"task","created_at":"2025-12-12T16:36:21.815781-05:00","updated_at":"2025-12-12T16:42:36.896806-05:00","closed_at":"2025-12-12T16:42:36.896806-05:00","dependencies":[{"issue_id":"qmd-18s","depends_on_id":"qmd-29c","type":"parent-child","created_at":"2025-12-12T16:37:03.014111-05:00","created_by":"daemon","metadata":"{}"}]}
-{"id":"qmd-1xd","title":"Update tests for YAML-based collections","description":"Update all tests to use YAML config instead of DB collections. Update test helpers to create temporary YAML configs.","notes":"Test suite has been updated for YAML-based collections. 92 tests passing, 4 skipped, 10 failing.\n\nThe 4 skipped tests call getStatus() which has a bug (queries non-existent collections table).\n\nThe 10 failing tests are due to bugs in store.ts functions (findDocument, getDocumentBody, getDocument, findSimilarFiles, matchFilesByGlob) that need to be updated to use YAML configuration. These are production code bugs, not test bugs.","status":"closed","priority":1,"issue_type":"task","created_at":"2025-12-13T09:54:53.349545-05:00","updated_at":"2025-12-13T11:37:16.935866-05:00","closed_at":"2025-12-13T11:37:16.935866-05:00","dependencies":[{"issue_id":"qmd-1xd","depends_on_id":"qmd-thw","type":"blocks","created_at":"2025-12-13T09:55:08.14305-05:00","created_by":"daemon","metadata":"{}"}]}
-{"id":"qmd-29c","title":"Move all database operations from qmd.ts to store.ts","description":"Currently qmd.ts has ~70 direct database operations (db.prepare, db.exec). All database operations should be moved to store.ts to improve separation of concerns. qmd.ts should only use high-level methods from store.ts that don't require direct SQL knowledge.","notes":"Phase 1 complete: Moved collection operations (listCollections, removeCollection, renameCollection) to store.ts. Created 4 subtasks for remaining work: document indexing, context management, embeddings, and cleanup operations.","status":"closed","priority":2,"issue_type":"task","created_at":"2025-12-12T16:32:13.722223-05:00","updated_at":"2025-12-12T16:49:53.829124-05:00","closed_at":"2025-12-12T16:49:53.829124-05:00"}
-{"id":"qmd-2gn","title":"Fix store.ts functions to use YAML collections","description":"Update findDocument(), getDocumentBody(), getDocument(), findSimilarFiles(), matchFilesByGlob(), and getStatus() to use YAML collection configuration instead of querying the collections table. These functions currently fail because they try to query the non-existent collections table.","notes":"Fixed:\n- FTS schema (filepath, title, body columns) \n- getStatus() to use YAML collections\n- searchFTS() to not query collections table\n- findDocument() absolute path matching\n\nTest results: 93 passing (up from 92), 4 skipped, 9 failing\n\nRemaining failures:\n- getDocumentBody (2 tests)\n- getDocument (1 test)  \n- findSimilarFiles (2 tests)\n- matchFilesByGlob (1 test)\n- Integration/context tests (3 tests)","status":"in_progress","priority":1,"issue_type":"bug","created_at":"2025-12-13T11:37:22.706882-05:00","updated_at":"2025-12-13T12:32:38.336752-05:00"}
-{"id":"qmd-3qi","title":"Document docid hash usage in CLAUDE.md","description":"Update CLAUDE.md to document that short hash IDs (#abc123) work with get and multi-get commands. Include examples.","status":"closed","priority":2,"issue_type":"task","created_at":"2025-12-15T12:54:30.634116-05:00","updated_at":"2025-12-15T13:12:38.91973-05:00","closed_at":"2025-12-15T13:12:38.91973-05:00","dependencies":[{"issue_id":"qmd-3qi","depends_on_id":"qmd-lwo","type":"parent-child","created_at":"2025-12-15T12:54:52.002856-05:00","created_by":"daemon","metadata":"{}"},{"issue_id":"qmd-3qi","depends_on_id":"qmd-apl","type":"blocks","created_at":"2025-12-15T12:54:52.066036-05:00","created_by":"daemon","metadata":"{}"}]}
-{"id":"qmd-3z9","title":"Design YAML schema and create collections.ts module","description":"Create collections.ts to manage YAML-based collection configuration at ~/.config/qmd/index.yml. Define TypeScript types for collections and contexts. Implement load/save functions with Bun's native YAML support.","design":"YAML structure:\n```yaml\n# Global context for all collections\nglobal_context: \"...\"\n\ncollections:\n  name:\n    path: /absolute/path\n    pattern: \"**/*.md\"\n    context:\n      \"/path/prefix\": \"Description\"\n      \"/\": \"Root context\"\n```\n\nTypeScript types:\n- Collection: { path, pattern, context }\n- CollectionConfig: { global_context?, collections }\n- Functions: loadConfig(), saveConfig(), getCollection(), listCollections()","status":"closed","priority":1,"issue_type":"task","created_at":"2025-12-13T09:54:52.586027-05:00","updated_at":"2025-12-13T09:56:57.309927-05:00","closed_at":"2025-12-13T09:56:57.309927-05:00"}
-{"id":"qmd-4ru","title":"Update document retrieval for new schema","description":"Functions like getDocument, findDocument, getMultipleDocuments need to work with new schema (path instead of filepath, content joins, virtual paths).","status":"closed","priority":0,"issue_type":"task","created_at":"2025-12-12T15:29:53.911881-05:00","updated_at":"2025-12-12T15:56:11.054888-05:00","closed_at":"2025-12-12T15:56:11.054888-05:00","dependencies":[{"issue_id":"qmd-4ru","depends_on_id":"qmd-ama","type":"discovered-from","created_at":"2025-12-12T15:29:53.912607-05:00","created_by":"daemon","metadata":"{}"}]}
-{"id":"qmd-4u4","title":"Move embedding/vector DB operations to store.ts","description":"Move vector indexing DB operations from vectorIndex() to store.ts. Create methods like getHashesForEmbedding(), insertEmbedding(), clearEmbeddings(), etc.","status":"closed","priority":2,"issue_type":"task","created_at":"2025-12-12T16:36:21.683434-05:00","updated_at":"2025-12-12T16:42:40.42653-05:00","closed_at":"2025-12-12T16:42:40.42653-05:00","dependencies":[{"issue_id":"qmd-4u4","depends_on_id":"qmd-29c","type":"parent-child","created_at":"2025-12-12T16:37:02.944591-05:00","created_by":"daemon","metadata":"{}"}]}
-{"id":"qmd-6s5","title":"Export current database to index.yml","description":"Write a script to export current collections and path_contexts from SQLite to ~/.config/qmd/index.yml format. Include all collection metadata and contexts.","status":"closed","priority":1,"issue_type":"task","created_at":"2025-12-13T09:54:52.707844-05:00","updated_at":"2025-12-13T09:57:36.650437-05:00","closed_at":"2025-12-13T09:57:36.650437-05:00","dependencies":[{"issue_id":"qmd-6s5","depends_on_id":"qmd-3z9","type":"blocks","created_at":"2025-12-13T09:55:07.606834-05:00","created_by":"daemon","metadata":"{}"}]}
-{"id":"qmd-7ss","title":"remove all the symlinks and stuff in the git repo, clean up the root directory","status":"closed","priority":4,"issue_type":"task","created_at":"2025-12-12T16:40:00.744982-05:00","updated_at":"2025-12-12T17:11:18.034215-05:00","closed_at":"2025-12-12T17:11:18.034215-05:00"}
-{"id":"qmd-8eu","title":"Update documents table schema for collection names","description":"Change documents.collection_id (integer FK) to documents.collection (text). Update all queries and indices. Keep backwards compatibility during transition.","design":"Schema change:\n- Add `collection TEXT` column\n- Migrate data: UPDATE documents SET collection = (SELECT name FROM collections WHERE id = collection_id)\n- Drop collection_id column\n- Update FTS5 trigger\n- Update all queries in store.ts","status":"closed","priority":1,"issue_type":"task","created_at":"2025-12-13T09:54:52.830305-05:00","updated_at":"2025-12-13T10:08:24.88716-05:00","closed_at":"2025-12-13T10:08:24.88716-05:00","dependencies":[{"issue_id":"qmd-8eu","depends_on_id":"qmd-6s5","type":"blocks","created_at":"2025-12-13T09:55:07.662048-05:00","created_by":"daemon","metadata":"{}"}]}
-{"id":"qmd-9ij","title":"Conditional query expansion based on BM25 signal strength","description":"Query expansion helps recall but injects false positives. Skip expansion if original BM25 top-5 has strong signals (exact term hits, high proximity). Only expand when recall is weak.","status":"closed","priority":3,"issue_type":"feature","created_at":"2025-12-20T17:18:41.806447-05:00","updated_at":"2025-12-21T12:05:40.85997-05:00","closed_at":"2025-12-21T12:05:40.85997-05:00"}
-{"id":"qmd-9ua","title":"Update all qmd commands for YAML-based collections","description":"Update qmd.ts commands: collection add/list/remove/rename, status, update, ls. All should use collections.ts instead of store.ts collection functions.","status":"closed","priority":1,"issue_type":"task","created_at":"2025-12-13T09:54:53.14644-05:00","updated_at":"2025-12-13T10:17:39.67707-05:00","closed_at":"2025-12-13T10:17:39.67707-05:00","dependencies":[{"issue_id":"qmd-9ua","depends_on_id":"qmd-u84","type":"blocks","created_at":"2025-12-13T09:55:07.893268-05:00","created_by":"daemon","metadata":"{}"},{"issue_id":"qmd-9ua","depends_on_id":"qmd-oxy","type":"blocks","created_at":"2025-12-13T09:55:07.942221-05:00","created_by":"daemon","metadata":"{}"}]}
-{"id":"qmd-afe","title":"implement qmd collection rename, which changes the global path prefix for the collection","status":"closed","priority":2,"issue_type":"task","created_at":"2025-12-12T15:55:54.779325-05:00","updated_at":"2025-12-12T16:29:24.153196-05:00","closed_at":"2025-12-12T16:29:24.153196-05:00"}
-{"id":"qmd-ama","title":"Refactor database system","description":"All documents should be stored as content addressable hash, e.g. hash, doc, created_at,\n┃ updated_at. documents should be a file system layer on top e.g. collection, path, hash,\n┃ created_at, updated_at. (collection,path)\n┃\n┃\n\n┃ All documents should be stored as content addressable hash, e.g. hash, doc, created_at,\n┃ updated_at. documents should be a file system layer on top e.g. collection_id, path, hash,\n┃ created_at, updated_at. (collection,path) is unique. There is also collection which stores PWD\n┃ + glob pattern, name (\\w+). Every document is treated as path qmd://collection.name/","notes":"## Completed\n- ✅ Implemented content-addressable storage (content table with hash→doc mapping)\n- ✅ Refactored documents table as file system layer (collection_id, path, hash)\n- ✅ Added collection names (e.g., \"pages\", \"journals\", \"archive\")\n- ✅ Implemented virtual paths (qmd://collection-name/path/to/file.md)\n- ✅ Added hierarchical context support (collection-scoped)\n- ✅ Successfully migrated existing database\n- ✅ Updated search functions to work with new schema\n- ✅ Updated indexing logic to use content-addressable storage\n- ✅ Orphaned content hash cleanup\n\n## Still TODO\n- Fix migration SQL to properly extract basename (currently needs manual fix)\n- Implement `qmd collection add . --name \u003cname\u003e --mask '**/*.md'`\n- Implement `qmd ls [path]` for exploring virtual file tree","status":"closed","priority":2,"issue_type":"task","created_at":"2025-12-10T10:57:35.497489-05:00","updated_at":"2025-12-12T15:39:48.879143-05:00","closed_at":"2025-12-12T15:39:48.879143-05:00"}
-{"id":"qmd-apl","title":"Support docid hash lookup in get and multi-get commands","description":"Allow get and multi-get to accept #hash (6 char) as a file identifier. Add lookup function to resolve short hash to full document path. Handle collisions gracefully.","status":"closed","priority":1,"issue_type":"task","created_at":"2025-12-15T12:54:30.448466-05:00","updated_at":"2025-12-15T13:12:38.741755-05:00","closed_at":"2025-12-15T13:12:38.741755-05:00","dependencies":[{"issue_id":"qmd-apl","depends_on_id":"qmd-lwo","type":"parent-child","created_at":"2025-12-15T12:54:51.903613-05:00","created_by":"daemon","metadata":"{}"},{"issue_id":"qmd-apl","depends_on_id":"qmd-gbt","type":"blocks","created_at":"2025-12-15T12:54:52.031069-05:00","created_by":"daemon","metadata":"{}"}]}
-{"id":"qmd-bs8","title":"Update documentation for YAML configuration","description":"Update CLAUDE.md, README.md with new YAML configuration approach. Document index.yml format and manual editing instructions.","status":"open","priority":2,"issue_type":"task","created_at":"2025-12-13T09:54:53.449584-05:00","updated_at":"2025-12-13T09:54:53.449584-05:00","dependencies":[{"issue_id":"qmd-bs8","depends_on_id":"qmd-1xd","type":"blocks","created_at":"2025-12-13T09:55:08.264615-05:00","created_by":"daemon","metadata":"{}"}]}
-{"id":"qmd-bx1","title":"Fix migration SQL for proper basename extraction","description":"The migration currently generates collection names incorrectly (uses full path instead of basename). Need to fix the SQL in migrateToContentAddressable to properly extract the directory basename.","status":"closed","priority":1,"issue_type":"bug","created_at":"2025-12-12T15:29:53.757723-05:00","updated_at":"2025-12-12T15:50:29.349134-05:00","closed_at":"2025-12-12T15:50:29.349134-05:00","dependencies":[{"issue_id":"qmd-bx1","depends_on_id":"qmd-ama","type":"discovered-from","created_at":"2025-12-12T15:29:53.758524-05:00","created_by":"daemon","metadata":"{}"}]}
-{"id":"qmd-c0m","title":"Comprehensive CLI review and consistency pass","description":"Review entire CLI command structure:\n- Consistent naming (add vs create, remove vs delete)\n- Consistent flag usage (--name, --mask, etc)\n- Update help text for all commands\n- Ensure virtual paths work everywhere\n- Test all commands end-to-end","status":"closed","priority":1,"issue_type":"task","created_at":"2025-12-12T15:29:38.083564-05:00","updated_at":"2025-12-12T16:06:51.544695-05:00","closed_at":"2025-12-12T16:06:51.544695-05:00"}
-{"id":"qmd-clr","title":"fix embed","status":"closed","priority":2,"issue_type":"task","created_at":"2025-12-12T16:14:55.292114-05:00","updated_at":"2025-12-12T16:31:27.661829-05:00","closed_at":"2025-12-12T16:31:27.661829-05:00"}
-{"id":"qmd-d00","title":"Add offline evaluation harness for tuning","description":"Create a small benchmark with ~100 labeled queries from real searches. Would enable tuning: expansion on/off threshold, candidate count (30 vs 100), blending weights, reranker threshold.","notes":"Test samples must be: 1) entirely synthetic, OR 2) public documents (e.g., public podcasts, public memos). No private/personal content in eval set.","status":"closed","priority":3,"issue_type":"feature","created_at":"2025-12-20T17:18:42.007265-05:00","updated_at":"2025-12-21T12:10:40.497797-05:00","closed_at":"2025-12-21T12:10:40.497797-05:00"}
-{"id":"qmd-deh","title":"Refactor database introduce qmd collection *","status":"closed","priority":2,"issue_type":"task","created_at":"2025-12-10T10:56:04.516137-05:00","updated_at":"2025-12-12T16:12:12.349428-05:00","closed_at":"2025-12-12T16:12:12.349428-05:00"}
-{"id":"qmd-df5","title":"Rerank multiple chunks per document with score aggregation","description":"Currently we only rerank 1 chunk per doc (selected by keyword heuristic). Should rerank top 2-3 chunks per document, then aggregate scores (max, softmax, or top-2 average). This improves ranking for long documents where the keyword-matched chunk isn't always the most relevant.","status":"closed","priority":2,"issue_type":"feature","created_at":"2025-12-20T17:18:41.592575-05:00","updated_at":"2025-12-21T12:04:11.777309-05:00","closed_at":"2025-12-21T12:04:11.777309-05:00"}
-{"id":"qmd-dmi","title":"Implement 'qmd collection' commands","description":"Add explicit collection management:\n- qmd collection add . --name \u003cname\u003e --mask '**/*.md'\n- qmd collection list\n- qmd collection remove \u003cname\u003e\n\nThis gives users control over collection names and patterns.","status":"closed","priority":1,"issue_type":"feature","created_at":"2025-12-12T15:29:53.810666-05:00","updated_at":"2025-12-12T16:02:08.079158-05:00","closed_at":"2025-12-12T16:02:08.079158-05:00","dependencies":[{"issue_id":"qmd-dmi","depends_on_id":"qmd-ama","type":"discovered-from","created_at":"2025-12-12T15:29:53.811294-05:00","created_by":"daemon","metadata":"{}"}]}
-{"id":"qmd-dt1","title":"Redesign context add command for better usability","description":"Current issues: \n1. Virtual path qmd://journals/ is rejected as invalid\n2. Syntax is confusing - sometimes path is first arg, sometimes second\n3. Need to support collection root context (qmd://name/)\n4. Should be intuitive: qmd context add \u003cwhere\u003e \u003cwhat\u003e\nDesign goals:\n- Support qmd://collection/ for collection root context\n- Support qmd://collection/path for path-specific context\n- Clear, consistent syntax\n- Good error messages","status":"closed","priority":1,"issue_type":"task","created_at":"2025-12-13T09:39:19.764114-05:00","updated_at":"2025-12-13T09:41:38.467861-05:00","closed_at":"2025-12-13T09:41:38.467861-05:00"}
-{"id":"qmd-e2c","title":"Implement 'qmd ls' command","description":"Add command to explore virtual file tree:\n- qmd ls → list all collections\n- qmd ls \u003ccollection\u003e → list files in collection\n- qmd ls \u003ccollection\u003e/\u003cpath\u003e → list files under path\nOutput: flat list of qmd:// paths","status":"closed","priority":1,"issue_type":"feature","created_at":"2025-12-12T15:29:53.859804-05:00","updated_at":"2025-12-12T15:55:12.777701-05:00","closed_at":"2025-12-12T15:55:12.777701-05:00","dependencies":[{"issue_id":"qmd-e2c","depends_on_id":"qmd-ama","type":"discovered-from","created_at":"2025-12-12T15:29:53.860535-05:00","created_by":"daemon","metadata":"{}"}]}
-{"id":"qmd-gbt","title":"Add docid field (first 6 chars of hash) to search results","description":"Include docid formatted as #hash[0:5] in all search output formats (CLI, JSON, CSV, Markdown, XML, files). The docid should be the first 6 characters of the document's SHA256 hash.","status":"closed","priority":1,"issue_type":"task","created_at":"2025-12-15T12:54:30.388094-05:00","updated_at":"2025-12-15T13:12:27.553492-05:00","closed_at":"2025-12-15T13:12:27.553492-05:00","dependencies":[{"issue_id":"qmd-gbt","depends_on_id":"qmd-lwo","type":"parent-child","created_at":"2025-12-15T12:54:51.846848-05:00","created_by":"daemon","metadata":"{}"}]}
-{"id":"qmd-ht6","title":"Add --line-numbers CLI option for line-numbered output","description":"Add --line-numbers flag to CLI that formats output with line numbers: each line becomes \"{lineNum}: {content}\". Apply to get, multi-get, and search snippet output.","status":"closed","priority":1,"issue_type":"task","created_at":"2025-12-15T12:54:30.51773-05:00","updated_at":"2025-12-15T13:12:38.810032-05:00","closed_at":"2025-12-15T13:12:38.810032-05:00","dependencies":[{"issue_id":"qmd-ht6","depends_on_id":"qmd-lwo","type":"parent-child","created_at":"2025-12-15T12:54:51.941635-05:00","created_by":"daemon","metadata":"{}"}]}
-{"id":"qmd-hw2","title":"Fix MCP resource listing and display paths","description":"Remove MCP resource listing, fix display paths to include collection name, handelize filenames during indexing, make line-numbers default for MCP snippets.","status":"closed","priority":1,"issue_type":"epic","created_at":"2025-12-16T12:00:02.815946-05:00","updated_at":"2025-12-16T12:26:17.690252-05:00","closed_at":"2025-12-16T12:26:17.690252-05:00"}
-{"id":"qmd-i3t","title":"Move context management DB operations to store.ts","description":"Move path_contexts INSERT/DELETE/SELECT operations from addContext(), listContexts(), removeContext() to store.ts. Create methods like insertContext(), deleteContext(), etc.","status":"closed","priority":2,"issue_type":"task","created_at":"2025-12-12T16:36:21.561746-05:00","updated_at":"2025-12-12T16:48:57.271485-05:00","closed_at":"2025-12-12T16:48:57.271485-05:00","dependencies":[{"issue_id":"qmd-i3t","depends_on_id":"qmd-29c","type":"parent-child","created_at":"2025-12-12T16:37:02.866006-05:00","created_by":"daemon","metadata":"{}"}]}
-{"id":"qmd-j9z","title":"Add unit tests for content addressable hashes","description":"add same file from multiple places and verify that they both point at same hash. drop one collection and the content stays.","status":"closed","priority":3,"issue_type":"task","created_at":"2025-12-12T15:39:15.459504-05:00","updated_at":"2025-12-12T16:21:35.473776-05:00","closed_at":"2025-12-12T16:21:35.473776-05:00"}
-{"id":"qmd-kf8","title":"Move document indexing DB operations to store.ts","description":"Move INSERT/UPDATE/DELETE operations for documents and content tables from indexFiles() to store.ts. Create methods like insertDocument(), updateDocument(), deactivateDocuments(), etc.","status":"closed","priority":2,"issue_type":"task","created_at":"2025-12-12T16:36:14.558702-05:00","updated_at":"2025-12-12T16:45:38.830978-05:00","closed_at":"2025-12-12T16:45:38.830978-05:00","dependencies":[{"issue_id":"qmd-kf8","depends_on_id":"qmd-29c","type":"parent-child","created_at":"2025-12-12T16:37:02.770251-05:00","created_by":"daemon","metadata":"{}"}]}
-{"id":"qmd-ltg","title":"look for missing context","description":"i ran qmd context list and thats only one bit of context, i had a lot more. i think the path matching isn't quite working right","status":"closed","priority":2,"issue_type":"task","created_at":"2025-12-12T16:42:57.324769-05:00","updated_at":"2025-12-12T17:16:27.835047-05:00","closed_at":"2025-12-12T17:16:27.835047-05:00"}
-{"id":"qmd-lwo","title":"Add short docid hash references and line numbers to output","description":"Add short 6-character hash IDs (docid #abc123) to search results and document retrieval, support these IDs in get/multi-get commands, and add --line-numbers option for line-numbered output.","status":"closed","priority":1,"issue_type":"epic","created_at":"2025-12-15T12:54:30.335556-05:00","updated_at":"2025-12-15T13:12:43.758557-05:00","closed_at":"2025-12-15T13:12:43.758557-05:00"}
-{"id":"qmd-mro","title":"Add lineNumbers boolean to MCP tools","description":"Add lineNumbers boolean parameter to MCP get, multi_get, and search tools. When true, return content with line numbers prefixed to each line.","status":"closed","priority":1,"issue_type":"task","created_at":"2025-12-15T12:54:30.581671-05:00","updated_at":"2025-12-15T13:12:38.8682-05:00","closed_at":"2025-12-15T13:12:38.8682-05:00","dependencies":[{"issue_id":"qmd-mro","depends_on_id":"qmd-lwo","type":"parent-child","created_at":"2025-12-15T12:54:51.970363-05:00","created_by":"daemon","metadata":"{}"}]}
-{"id":"qmd-oxy","title":"Update context system to use YAML","description":"Remove path_contexts table. Implement context management in collections.ts. Update context add/list/rm commands to modify YAML file instead of database.","status":"closed","priority":1,"issue_type":"task","created_at":"2025-12-13T09:54:53.042839-05:00","updated_at":"2025-12-13T10:16:07.680285-05:00","closed_at":"2025-12-13T10:16:07.680285-05:00","dependencies":[{"issue_id":"qmd-oxy","depends_on_id":"qmd-3z9","type":"blocks","created_at":"2025-12-13T09:55:07.842488-05:00","created_by":"daemon","metadata":"{}"}]}
-{"id":"qmd-p1h","title":"Create collection add|remove","status":"closed","priority":2,"issue_type":"task","created_at":"2025-12-10T10:57:00.717864-05:00","updated_at":"2025-12-12T16:12:00.557003-05:00","closed_at":"2025-12-12T16:12:00.557003-05:00"}
-{"id":"qmd-rck","title":"move the source files to src/*, clean up teh directory","status":"closed","priority":2,"issue_type":"task","created_at":"2025-12-12T16:40:19.198119-05:00","updated_at":"2025-12-12T17:12:22.502746-05:00","closed_at":"2025-12-12T17:12:22.502746-05:00"}
-{"id":"qmd-rhd","title":"Fix 'qmd status' output for new schema","description":"Update status to show collections by name, cleaner context display, virtual path examples.","status":"closed","priority":2,"issue_type":"task","created_at":"2025-12-12T15:29:54.020596-05:00","updated_at":"2025-12-12T16:13:28.08389-05:00","closed_at":"2025-12-12T16:13:28.08389-05:00","dependencies":[{"issue_id":"qmd-rhd","depends_on_id":"qmd-ama","type":"discovered-from","created_at":"2025-12-12T15:29:54.021095-05:00","created_by":"daemon","metadata":"{}"}]}
-{"id":"qmd-s1y","title":"Update 'qmd add-context' for collection scoping","description":"Update add-context to work with collection-scoped contexts using new path_contexts schema.","notes":"Refactoring to:\n- qmd context add [path] \"text\" (defaults to current collection if in one)\n- qmd context list\n- qmd context rm \u003cpath\u003e\n- Support \"/\" for global/system context\n- Auto-detect collection from pwd","status":"closed","priority":2,"issue_type":"task","created_at":"2025-12-12T15:29:54.076582-05:00","updated_at":"2025-12-12T15:37:47.683263-05:00","closed_at":"2025-12-12T15:37:47.683263-05:00"}
-{"id":"qmd-thw","title":"Drop collections and path_contexts tables","description":"Remove collections and path_contexts tables from schema. Update initDb() to not create these tables. Only keep documents, content, and search indices.","status":"closed","priority":1,"issue_type":"task","created_at":"2025-12-13T09:54:53.247136-05:00","updated_at":"2025-12-13T10:27:06.54374-05:00","closed_at":"2025-12-13T10:27:06.54374-05:00","dependencies":[{"issue_id":"qmd-thw","depends_on_id":"qmd-9ua","type":"blocks","created_at":"2025-12-13T09:55:08.027101-05:00","created_by":"daemon","metadata":"{}"}]}
-{"id":"qmd-u84","title":"Refactor store.ts to use collections.ts","description":"Replace all collection DB queries with collections.ts calls. Remove getCollectionById, getCollectionByName, listCollections DB functions. Use YAML config instead.","status":"closed","priority":1,"issue_type":"task","created_at":"2025-12-13T09:54:52.936782-05:00","updated_at":"2025-12-13T10:16:07.681047-05:00","closed_at":"2025-12-13T10:16:07.681047-05:00","dependencies":[{"issue_id":"qmd-u84","depends_on_id":"qmd-3z9","type":"blocks","created_at":"2025-12-13T09:55:07.720439-05:00","created_by":"daemon","metadata":"{}"},{"issue_id":"qmd-u84","depends_on_id":"qmd-8eu","type":"blocks","created_at":"2025-12-13T09:55:07.782051-05:00","created_by":"daemon","metadata":"{}"}]}
-{"id":"qmd-vro","title":"Update 'qmd get' to support virtual paths","description":"Allow qmd get to accept both virtual paths (qmd://journals/...) and filesystem paths, plus fuzzy matching by filename.","status":"closed","priority":0,"issue_type":"task","created_at":"2025-12-12T15:29:53.963113-05:00","updated_at":"2025-12-12T15:47:29.178955-05:00","closed_at":"2025-12-12T15:47:29.178955-05:00","dependencies":[{"issue_id":"qmd-vro","depends_on_id":"qmd-ama","type":"discovered-from","created_at":"2025-12-12T15:29:53.963641-05:00","created_by":"daemon","metadata":"{}"}]}
-{"id":"qmd-x19","title":"Update 'qmd add-context' for collection-scoped contexts","description":"Update add-context to work with collections:\n- qmd add-context \u003ccollection\u003e/\u003cpath\u003e \"context description\"\n- Support both virtual and filesystem paths\n- Update to use new path_contexts schema","status":"closed","priority":2,"issue_type":"task","created_at":"2025-12-12T15:29:38.142575-05:00","updated_at":"2025-12-12T15:53:00.525001-05:00","closed_at":"2025-12-12T15:53:00.525001-05:00"}
-{"id":"qmd-x64","title":"for each collection, on update, check if there is a .git directory, if so write out the git status, add --pull as a qmd update --pull parameter which also executes git pull before reindexing\n","status":"closed","priority":2,"issue_type":"task","created_at":"2025-12-12T17:04:15.994054-05:00","updated_at":"2025-12-12T17:14:40.107181-05:00","closed_at":"2025-12-12T17:14:40.107181-05:00"}
-{"id":"qmd-yzj","title":"Add optional update: command support to collections YAML","description":"Collections can now specify an optional 'update:' key with a bash command that will be executed during 'qmd update' before indexing files. Runs in cwd, prints output, stops on error.","status":"closed","priority":2,"issue_type":"feature","created_at":"2025-12-13T11:16:32.527608-05:00","updated_at":"2025-12-13T11:17:57.500434-05:00","closed_at":"2025-12-13T11:17:57.500434-05:00"}
-{"id":"qmd-zin","title":"Improve qmd ls command to be more like ls -l with colors","description":"Make qmd ls more Unix-like:\n1. Format like ls -l with columns (permissions, size, date, name)\n2. Add colors (directories, files, etc.)\n3. Dim the qmd:// prefix to show it's optional\n4. Show file sizes in human-readable format\n5. Show modification times\n6. Consider adding -l flag for long format","status":"closed","priority":1,"issue_type":"task","created_at":"2025-12-13T09:44:48.703843-05:00","updated_at":"2025-12-13T09:48:22.298822-05:00","closed_at":"2025-12-13T09:48:22.298822-05:00"}

+ 51 - 9
finetune/README.md

@@ -21,16 +21,16 @@ These feed into QMD's three search backends:
 
 ## Quick Start
 
-### End-to-end pipeline for Qwen3-1.7B
+### Cloud training via HuggingFace Jobs (no GPU needed)
 
 ```bash
-# 1. SFT: teach the model the output format from labeled examples
-uv run train.py sft --config configs/sft.yaml
+# 1. SFT: teach the model the output format (~45 min on A10G, ~$1.50)
+hf jobs uv run --flavor a10g-large --secrets HF_TOKEN --timeout 2h jobs/sft.py
 
-# 2. GRPO: improve quality via RL using the reward function
-uv run train.py grpo --config configs/grpo.yaml
+# 2. GRPO: RL refinement on top of SFT (~20 min on A10G, ~$0.50)
+hf jobs uv run --flavor a10g-large --secrets HF_TOKEN --timeout 4h jobs/grpo.py
 
-# 3. Evaluate against test queries
+# 3. Evaluate against test queries (needs local GPU or use eval job)
 uv run eval.py --model tobil/qmd-query-expansion-1.7B-grpo \
                --sft-model tobil/qmd-query-expansion-1.7B-sft
 
@@ -38,6 +38,22 @@ uv run eval.py --model tobil/qmd-query-expansion-1.7B-grpo \
 uv run convert_gguf.py --size 1.7B
 ```
 
+### Local training (if you have a GPU)
+
+```bash
+uv run train.py sft  --config configs/sft.yaml
+uv run train.py grpo --config configs/grpo.yaml
+```
+
+### Monitoring HF Jobs
+
+```bash
+hf jobs ps                           # list running jobs
+hf jobs inspect <job-id>             # check status
+hf jobs logs <job-id>                # stream logs
+hf jobs cancel <job-id>              # cancel a job
+```
+
 ## Prompt Format
 
 All tools use the same prompt — **Qwen3 chat template with `/no_think`**:
@@ -59,13 +75,16 @@ finetune/
 ├── train.py           # Unified SFT + GRPO training (two subcommands)
 ├── eval.py            # Generate expansions and score them
 ├── convert_gguf.py    # GGUF conversion for Ollama/llama.cpp
+├── jobs/
+│   ├── sft.py         # Self-contained SFT for HuggingFace Jobs
+│   └── grpo.py        # Self-contained GRPO for HuggingFace Jobs
 ├── configs/
 │   ├── sft.yaml       # SFT hyperparameters for Qwen3-1.7B
 │   └── grpo.yaml      # GRPO hyperparameters for Qwen3-1.7B
 ├── evals/
-│   └── queries.txt    # 27 test queries across 7 categories
+│   └── queries.txt    # 31 test queries across 8 categories
 ├── data/
-│   └── qmd_expansion.jsonl  # Source training data (5,730 examples)
+│   └── qmd_expansion.jsonl  # Source training data (5,742 examples)
 ├── dataset/
 │   ├── generate_data.py         # Generate data via Claude API
 │   ├── generate_data_offline.py # Generate from existing HF dataset
@@ -86,7 +105,7 @@ Teaches the model the `lex:/vec:/hyde:` output format from labeled examples.
 | Base model | `Qwen/Qwen3-1.7B` |
 | Method | LoRA (rank 16, alpha 32) |
 | Target modules | All projection layers (q/k/v/o/gate/up/down) |
-| Dataset | 6,180 examples (26.5% short queries) |
+| Dataset | 11,124 examples (train split) |
 | Effective batch size | 16 (4 × 4 gradient accumulation) |
 | Epochs | 3 |
 | Learning rate | 2e-4 (cosine schedule) |
@@ -231,3 +250,26 @@ The two-stage training approach (SFT → GRPO) is standard for structured-output
 
 The reward function is entirely rule-based (no LLM judge) which makes it fast,
 deterministic, and suitable as an RL signal. See `SCORING.md` for the full rubric.
+
+## Training Results (Qwen3-1.7B)
+
+### SFT
+
+| Metric | Value |
+|--------|-------|
+| Final train loss | 0.223 |
+| Final eval loss | 0.321 |
+| Token accuracy (train) | 94.8% |
+| Token accuracy (eval) | 92.4% |
+| Hardware | A10G (24 GB VRAM) |
+
+### GRPO
+
+| Metric | Value |
+|--------|-------|
+| Mean reward | 0.757 |
+| Final loss | 0.0005 |
+| KL divergence | 0.00048 |
+| Mean completion length | ~58 tokens |
+| Training time | ~19 min (200 steps) |
+| Hardware | A10G (24 GB VRAM) |

+ 12 - 0
finetune/data/qmd_expansion.jsonl

@@ -5728,3 +5728,15 @@
 {"input": "baby food", "output": "lex: infant meal\nlex: child nutrition\nvec: infant meal\nvec: child nutrition\nhyde: The topic of baby food covers child nutrition. Proper implementation follows established patterns and best practices."}
 {"input": "cloud storage options", "output": "lex: overview of popular\nlex: importance of backing\nvec: overview of popular cloud storage services available\nvec: importance of backing up data securely\nhyde: The cloud storage options configuration can be customized by overview of popular cloud storage services available. Default values work for most use cases."}
 {"input": "what is the role of irony in literature?", "output": "lex: definition of irony\nlex: importance of irony\nvec: definition of irony and its significance\nvec: importance of irony in creating depth in text\nhyde: The role of irony in literature? is defined as debates surrounding the interpretations of irony. This plays a crucial role in modern development practices."}
+{"input": "recent news about Shopify", "output": "lex: shopify corporate news 2025 2026\nlex: shopify product updates recent\nlex: shopify earnings announcements latest\nvec: what are the latest news and developments about Shopify\nvec: recent Shopify product launches and company announcements\nhyde: Shopify recently announced new features for its commerce platform, including AI-powered tools for merchants and expanded checkout capabilities in Q4 2025."}
+{"input": "latest AI developments", "output": "lex: artificial intelligence breakthroughs 2025 2026\nlex: AI news updates recent\nlex: machine learning advances latest\nvec: what are the most recent developments in artificial intelligence\nvec: latest breakthroughs and advances in AI and machine learning\nhyde: Recent AI developments include advances in reasoning models, multimodal systems, and on-device inference capabilities released throughout 2025."}
+{"input": "new features in React", "output": "lex: React new features 2025 2026\nlex: React latest release changelog\nlex: React updates improvements\nvec: what new features have been added to React recently\nvec: latest React release notes and new capabilities\nhyde: React 19 introduced server components, actions, and improved concurrent rendering. The latest updates in 2025 added compiler optimizations and new hooks."}
+{"input": "current stock market trends", "output": "lex: stock market trends 2025 2026\nlex: market performance recent\nlex: equity market outlook current\nvec: what are the current trends in the stock market\nvec: recent stock market performance and investment trends\nhyde: The stock market in 2025 has been characterized by continued growth in technology stocks and increased volatility driven by interest rate decisions and geopolitical factors."}
+{"input": "today weather forecast", "output": "lex: weather forecast today current\nlex: weather conditions now\nvec: what is the current weather forecast for today\nvec: today's weather conditions and temperature\nhyde: The current weather forecast shows conditions for the day with expected temperatures and precipitation levels based on the latest meteorological data."}
+{"input": "upcoming conferences tech", "output": "lex: technology conferences 2025 2026\nlex: tech events upcoming schedule\nlex: developer conferences next\nvec: what technology conferences are coming up next\nvec: upcoming developer and tech industry events and conferences\nhyde: Major upcoming technology conferences include developer summits, AI conferences, and industry events scheduled for late 2025 and early 2026."}
+{"input": "recent changes to immigration policy", "output": "lex: immigration policy changes 2025 2026\nlex: immigration reform updates recent\nlex: visa policy latest news\nvec: what are the recent changes to immigration policy\nvec: latest immigration reform updates and policy changes\nhyde: Recent immigration policy changes include updates to visa processing, work permit regulations, and border control measures enacted in 2025."}
+{"input": "this week startup funding", "output": "lex: startup funding rounds 2025 recent\nlex: venture capital deals this week\nlex: startup investment news latest\nvec: what startups received funding this week\nvec: recent venture capital and startup funding announcements\nhyde: This week saw several notable startup funding rounds, with companies raising Series A and B rounds across AI, fintech, and climate technology sectors."}
+{"input": "Shopify Q4 earnings", "output": "lex: Shopify Q4 earnings report 2025\nlex: Shopify quarterly results revenue\nlex: Shopify financial performance\nvec: what were Shopify's Q4 earnings results\nvec: Shopify quarterly earnings report and financial performance\nhyde: Shopify reported Q4 2025 earnings showing revenue growth driven by merchant solutions and subscription revenue increases across its commerce platform."}
+{"input": "best laptops right now", "output": "lex: best laptops 2025 2026\nlex: top rated laptops current\nlex: laptop recommendations latest\nvec: what are the best laptops available right now\nvec: current top-rated laptop recommendations and reviews\nhyde: The best laptops currently available in 2025 include models with the latest processors, improved battery life, and AI-enhanced features from manufacturers like Apple, Lenovo, and Dell."}
+{"input": "recently deprecated Python features", "output": "lex: Python deprecated features 2025\nlex: Python breaking changes recent\nlex: Python removal schedule latest\nvec: what Python features have been recently deprecated\nvec: recent Python deprecations and breaking changes in latest versions\nhyde: Python 3.13 and 3.14 deprecated several features including legacy string formatting methods and certain stdlib modules, with removal scheduled for upcoming releases."}
+{"input": "what changed in kubernetes latest version", "output": "lex: Kubernetes changelog latest version 2025\nlex: Kubernetes new features recent release\nlex: k8s updates changes\nvec: what are the changes in the latest Kubernetes version\nvec: recent Kubernetes release notes and new features\nhyde: The latest Kubernetes release introduced enhancements to pod scheduling, improved sidecar container support, and new gateway API features for service mesh integration."}

+ 6 - 0
finetune/evals/queries.txt

@@ -36,6 +36,12 @@ connection timeout error
 memory leak debugging
 cors error fix
 
+# Temporal / recency queries (should expand with years, "recent", "latest")
+recent news about Shopify
+latest AI developments
+best laptops right now
+what changed in kubernetes latest version
+
 # Complex queries
 how to implement caching with redis in nodejs
 best practices for api rate limiting

+ 389 - 0
finetune/jobs/grpo.py

@@ -0,0 +1,389 @@
+# /// script
+# requires-python = ">=3.10"
+# dependencies = [
+#     "trl>=0.12.0",
+#     "peft>=0.7.0",
+#     "transformers>=4.45.0",
+#     "accelerate>=0.24.0",
+#     "huggingface_hub>=0.20.0",
+#     "datasets",
+#     "bitsandbytes",
+#     "torch",
+# ]
+# ///
+"""
+GRPO training for QMD query expansion (Qwen3-1.7B).
+
+Runs on top of merged SFT weights. Self-contained for HuggingFace Jobs:
+    hf jobs uv run --flavor a10g-large --secrets HF_TOKEN --timeout 4h jobs/grpo.py
+"""
+
+import os
+import re
+from collections import Counter
+
+import torch
+from datasets import load_dataset
+from huggingface_hub import login
+from peft import LoraConfig, PeftModel, get_peft_model
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from trl import GRPOTrainer, GRPOConfig
+
+# --- Config (inlined from configs/grpo.yaml) ---
+BASE_MODEL = "Qwen/Qwen3-1.7B"
+SFT_MODEL = "tobil/qmd-query-expansion-1.7B-sft"
+OUTPUT_MODEL = "tobil/qmd-query-expansion-1.7B-grpo"
+DATASET = "tobil/qmd-query-expansion-train-v2"
+
+# =============================================================================
+# Reward function (inlined from reward.py — single source of truth)
+# =============================================================================
+
+STOPWORDS = frozenset({
+    'the', 'a', 'an', 'is', 'are', 'to', 'for', 'of', 'in',
+    'and', 'or', 'it', 'this', 'that', 'be', 'with', 'as', 'on', 'by',
+})
+
+KEY_TERM_STOPWORDS = frozenset({
+    'what', 'is', 'how', 'to', 'the', 'a', 'an', 'in', 'on', 'for', 'of',
+    'and', 'or', 'with', 'my', 'your', 'do', 'does', 'can', 'i', 'me', 'we',
+    'who', 'where', 'when', 'why', 'which', 'find', 'get', 'show', 'tell',
+})
+
+GENERIC_LEX_PHRASES = frozenset({
+    'find information about', 'search for', 'look up', 'get information',
+    'learn about', 'information on', 'details about', 'find out about',
+    'what is', 'how to', 'guide to', 'help with',
+})
+
+CHAT_TEMPLATE_TOKENS = frozenset({
+    '<|im_start|>', '<|im_end|>', '<|endoftext|>',
+    '\nassistant\n', '\nuser\n',
+})
+
+
+def parse_expansion(text: str) -> dict:
+    result = {"lex": [], "vec": [], "hyde": [], "invalid": []}
+    for line in text.strip().split("\n"):
+        line = line.strip()
+        if not line:
+            continue
+        if line.startswith("lex:"):
+            result["lex"].append(line[4:].strip())
+        elif line.startswith("vec:"):
+            result["vec"].append(line[4:].strip())
+        elif line.startswith("hyde:"):
+            result["hyde"].append(line[5:].strip())
+        else:
+            result["invalid"].append(line)
+    return result
+
+
+def clean_model_output(text: str) -> tuple[str, bool]:
+    text = text.replace('<|im_end|>', '').strip()
+    used_thinking = '<think>' in text and '</think>' in text
+    if used_thinking:
+        text = re.sub(r'<think>.*?</think>', '', text, flags=re.DOTALL).strip()
+    return text, used_thinking
+
+
+def extract_named_entities(query: str) -> set:
+    entities = set()
+    words = query.split()
+    prev_was_entity = False
+    for i, word in enumerate(words):
+        clean = word.strip('.,!?:;()[]"\'')
+        if not clean:
+            prev_was_entity = False
+            continue
+        is_entity = False
+        if clean.isupper() and len(clean) >= 2:
+            entities.add(clean.lower())
+            is_entity = True
+        elif i > 0 and clean[0].isupper() and clean.lower() not in KEY_TERM_STOPWORDS:
+            entities.add(clean.lower())
+            is_entity = True
+        elif any(c in clean for c in '.+-#@') and len(clean) >= 2:
+            entities.add(clean.lower())
+            is_entity = True
+        elif len(clean) > 1 and any(c.isupper() for c in clean[1:]) and clean[0].isupper():
+            entities.add(clean.lower())
+            is_entity = True
+        elif prev_was_entity and clean.lower() not in KEY_TERM_STOPWORDS:
+            entities.add(clean.lower())
+            is_entity = True
+        prev_was_entity = is_entity
+    return entities
+
+
+def get_key_terms(query: str) -> set:
+    return set(query.lower().split()) - KEY_TERM_STOPWORDS
+
+
+def lex_preserves_key_terms(lex_line: str, query: str) -> bool:
+    key_terms = get_key_terms(query)
+    if not key_terms:
+        return True
+    return bool(key_terms & set(lex_line.lower().split()))
+
+
+def lex_preserves_entities(line: str, entities: set) -> bool:
+    if not entities:
+        return True
+    lower = line.lower()
+    return any(e in lower for e in entities)
+
+
+def lex_is_generic(lex_line: str) -> bool:
+    lower = lex_line.lower().strip()
+    for phrase in GENERIC_LEX_PHRASES:
+        if phrase in lower or lower.startswith(phrase.split()[0]):
+            remaining = lower
+            for word in phrase.split():
+                remaining = remaining.replace(word, '', 1).strip()
+            if len(remaining) < 3:
+                return True
+    return False
+
+
+def word_set_distance(a: str, b: str) -> int:
+    return len(set(a.lower().split()) ^ set(b.lower().split()))
+
+
+def is_diverse(a: str, b: str, min_distance: int = 2) -> bool:
+    a, b = a.lower().strip(), b.lower().strip()
+    if a == b or a in b or b in a:
+        return False
+    return word_set_distance(a, b) >= min_distance
+
+
+def echoes_query(expansion: str, query: str) -> bool:
+    exp, q = expansion.lower().strip(), query.lower().strip()
+    return exp == q or (q in exp and len(exp) < len(q) + 10)
+
+
+def word_repetition_penalty(text: str) -> int:
+    counts = Counter(re.findall(r'\b\w+\b', text.lower()))
+    return sum((c - 2) * 2 for w, c in counts.items()
+               if c >= 3 and w not in STOPWORDS and len(w) > 2)
+
+
+def score_expansion(query: str, expansion: str) -> float:
+    """Score expansion as float in [0.0, 1.0] for RL reward."""
+    text, used_thinking = clean_model_output(expansion.strip())
+
+    # Hard fail: chat template leakage
+    if any(tok in text for tok in CHAT_TEMPLATE_TOKENS):
+        return 0.0
+
+    # Hard fail: invalid lines
+    for line in text.split("\n"):
+        line = line.strip()
+        if line and not line.startswith(("lex:", "vec:", "hyde:")):
+            return 0.0
+
+    parsed = parse_expansion(text)
+
+    # Format (0-30)
+    format_score = 10  # no invalid lines
+    if parsed["lex"]:
+        format_score += 10
+    if parsed["vec"]:
+        format_score += 10
+
+    # Diversity (0-30)
+    diversity_score = 0
+    types_present = sum(1 for t in ("lex", "vec") if parsed[t])
+    if types_present >= 2:
+        diversity_score += 10
+    if len(parsed["lex"]) + len(parsed["vec"]) >= 2:
+        diversity_score += 5
+    lex_div = 5
+    for i, a in enumerate(parsed["lex"]):
+        for b in parsed["lex"][i+1:]:
+            if not is_diverse(a, b, 2):
+                lex_div -= 2
+    diversity_score += max(0, lex_div)
+    vec_div = 5
+    for i, a in enumerate(parsed["vec"]):
+        for b in parsed["vec"][i+1:]:
+            if not is_diverse(a, b, 3):
+                vec_div -= 2
+    diversity_score += max(0, vec_div)
+    echo = 5
+    for exp in parsed["lex"] + parsed["vec"]:
+        if echoes_query(exp, query):
+            echo -= 3
+    diversity_score += max(0, echo)
+
+    # HyDE (0-20)
+    hyde_score = 0
+    if parsed["hyde"]:
+        hyde_text = parsed["hyde"][0]
+        hyde_score += 5
+        hyde_len = len(hyde_text)
+        if 50 <= hyde_len <= 200:
+            hyde_score += 5
+        elif hyde_len < 50:
+            hyde_score += 2
+        if "\n" not in hyde_text:
+            hyde_score += 5
+        hyde_score += max(0, 5 - word_repetition_penalty(hyde_text))
+
+    # Quality (0-20)
+    quality_score = 5
+    if parsed["lex"] and parsed["vec"]:
+        avg_lex = sum(len(l) for l in parsed["lex"]) / len(parsed["lex"])
+        avg_vec = sum(len(v) for v in parsed["vec"]) / len(parsed["vec"])
+        if avg_lex <= avg_vec:
+            quality_score += 5
+    if parsed["vec"]:
+        natural = sum(1 for v in parsed["vec"] if " " in v and len(v) > 15)
+        quality_score += 5 if natural == len(parsed["vec"]) else 2
+    if parsed["lex"]:
+        with_terms = sum(1 for l in parsed["lex"] if lex_preserves_key_terms(l, query))
+        if with_terms == len(parsed["lex"]):
+            quality_score += 5
+        elif with_terms > 0:
+            quality_score += 2
+
+    # Entity (-45 to +20)
+    entity_score = 0
+    entities = extract_named_entities(query)
+    if entities and parsed["lex"]:
+        with_entities = sum(1 for l in parsed["lex"] if lex_preserves_entities(l, entities))
+        if with_entities == len(parsed["lex"]):
+            entity_score += 15
+        elif with_entities > 0:
+            entity_score += 5
+        else:
+            entity_score -= 30
+        generic_count = sum(1 for l in parsed["lex"] if lex_is_generic(l))
+        if generic_count:
+            entity_score -= generic_count * 15
+        if parsed["vec"]:
+            vec_with = sum(1 for v in parsed["vec"] if lex_preserves_entities(v, entities))
+            if vec_with > 0:
+                entity_score += 5
+    elif not entities:
+        entity_score = 10
+
+    # Think bonus (0-20)
+    think_bonus = 0 if used_thinking else 20
+
+    total = format_score + diversity_score + hyde_score + quality_score + entity_score + think_bonus
+    max_possible = 140 if parsed["hyde"] else 120
+    return max(0.0, min(1.0, total / max_possible))
+
+
+def extract_query_from_prompt(prompt: str) -> str:
+    if "Expand this search query:" in prompt:
+        query = prompt.split("Expand this search query:")[-1].strip()
+        if "<|im_end|>" in query:
+            query = query.split("<|im_end|>")[0].strip()
+        return query
+    return prompt.strip()
+
+
+class QMDRewardFunction:
+    __name__ = "qmd_scoring_reward"
+
+    def __call__(self, completions: list[str], prompts: list[str] = None, **kwargs) -> list[float]:
+        rewards = []
+        for i, completion in enumerate(completions):
+            query = ""
+            if prompts and i < len(prompts):
+                query = extract_query_from_prompt(prompts[i])
+            rewards.append(score_expansion(query, completion))
+        return rewards
+
+
+# =============================================================================
+# Main training
+# =============================================================================
+
+def main():
+    hf_token = os.environ.get("HF_TOKEN")
+    if hf_token:
+        login(token=hf_token)
+
+    print(f"Loading tokenizer from {BASE_MODEL}...")
+    tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+
+    # Load and format dataset
+    print(f"Loading dataset: {DATASET}...")
+    dataset = load_dataset(DATASET, split="train")
+
+    def extract_prompt(example):
+        content = example["messages"][0]["content"]
+        messages = [{"role": "user", "content": content}]
+        formatted = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+        return {"prompt": formatted}
+
+    dataset = dataset.map(extract_prompt, remove_columns=dataset.column_names)
+    dataset = dataset.shuffle(seed=42).select(range(min(1000, len(dataset))))
+    print(f"Using {len(dataset)} prompts for GRPO")
+
+    # Load base model, merge SFT adapter
+    print(f"Loading base model {BASE_MODEL}...")
+    base_model = AutoModelForCausalLM.from_pretrained(
+        BASE_MODEL, torch_dtype=torch.bfloat16, device_map="auto",
+    )
+    print(f"Merging SFT adapter {SFT_MODEL}...")
+    model = PeftModel.from_pretrained(base_model, SFT_MODEL)
+    model = model.merge_and_unload()
+    print("SFT adapter merged.")
+
+    # Fresh LoRA for GRPO (small: rank 4, q/v only)
+    grpo_lora = LoraConfig(
+        r=4, lora_alpha=8, lora_dropout=0.05,
+        bias="none", task_type="CAUSAL_LM",
+        target_modules=["q_proj", "v_proj"],
+    )
+    model = get_peft_model(model, grpo_lora)
+    model.print_trainable_parameters()
+
+    config = GRPOConfig(
+        output_dir="qmd-query-expansion-1.7B-grpo",
+        push_to_hub=True,
+        hub_model_id=OUTPUT_MODEL,
+
+        num_generations=4,
+        max_completion_length=200,
+        beta=0.04,  # KL regularization — prevents drift from SFT checkpoint
+
+        num_train_epochs=1,
+        per_device_train_batch_size=2,
+        gradient_accumulation_steps=8,
+        learning_rate=5e-7,
+        max_grad_norm=0.5,
+        max_steps=200,
+
+        logging_steps=10,
+        save_strategy="epoch",
+        bf16=True,
+
+        report_to="none",
+    )
+
+    print("Initializing GRPO trainer...")
+    trainer = GRPOTrainer(
+        model=model,
+        processing_class=tokenizer,
+        args=config,
+        train_dataset=dataset,
+        reward_funcs=[QMDRewardFunction()],
+    )
+
+    print("Starting GRPO training...")
+    trainer.train()
+
+    print("Pushing to Hub...")
+    trainer.push_to_hub()
+    print(f"Done! Model: https://huggingface.co/{OUTPUT_MODEL}")
+
+
+if __name__ == "__main__":
+    main()

+ 98 - 0
finetune/jobs/sft.py

@@ -0,0 +1,98 @@
+# /// script
+# requires-python = ">=3.10"
+# dependencies = [
+#     "trl>=0.12.0",
+#     "peft>=0.7.0",
+#     "transformers>=4.45.0",
+#     "accelerate>=0.24.0",
+#     "huggingface_hub>=0.20.0",
+#     "datasets",
+#     "bitsandbytes",
+#     "torch",
+# ]
+# ///
+"""
+SFT training for QMD query expansion (Qwen3-1.7B).
+
+Self-contained script for HuggingFace Jobs:
+    hf jobs uv run --flavor a10g-large --secrets HF_TOKEN --timeout 2h jobs/sft.py
+"""
+
+import os
+from huggingface_hub import login
+
+# --- Config (inlined from configs/sft.yaml) ---
+BASE_MODEL = "Qwen/Qwen3-1.7B"
+OUTPUT_MODEL = "tobil/qmd-query-expansion-1.7B-sft"
+DATASET = "tobil/qmd-query-expansion-train-v2"
+
+hf_token = os.environ.get("HF_TOKEN")
+if hf_token:
+    login(token=hf_token)
+
+from datasets import load_dataset
+from peft import LoraConfig
+from trl import SFTTrainer, SFTConfig
+
+# Load and split dataset
+print(f"Loading dataset: {DATASET}...")
+dataset = load_dataset(DATASET, split="train")
+print(f"Dataset loaded: {len(dataset)} examples")
+
+split = dataset.train_test_split(test_size=0.1, seed=42)
+train_dataset = split["train"]
+eval_dataset = split["test"]
+print(f"  Train: {len(train_dataset)}, Eval: {len(eval_dataset)}")
+
+# SFT config
+config = SFTConfig(
+    output_dir="qmd-query-expansion-1.7B-sft",
+    push_to_hub=True,
+    hub_model_id=OUTPUT_MODEL,
+    hub_strategy="every_save",
+
+    num_train_epochs=3,
+    per_device_train_batch_size=4,
+    gradient_accumulation_steps=4,
+    learning_rate=2e-4,
+    max_length=512,
+
+    logging_steps=10,
+    save_strategy="steps",
+    save_steps=200,
+    save_total_limit=2,
+    eval_strategy="steps",
+    eval_steps=200,
+
+    warmup_ratio=0.03,
+    lr_scheduler_type="cosine",
+    bf16=True,
+
+    report_to="none",
+)
+
+# LoRA: rank 16, all projection layers
+peft_config = LoraConfig(
+    r=16,
+    lora_alpha=32,
+    lora_dropout=0.0,
+    bias="none",
+    task_type="CAUSAL_LM",
+    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
+)
+
+print("Initializing SFT trainer...")
+trainer = SFTTrainer(
+    model=BASE_MODEL,
+    train_dataset=train_dataset,
+    eval_dataset=eval_dataset,
+    args=config,
+    peft_config=peft_config,
+)
+
+print("Starting SFT training...")
+trainer.train()
+
+print("Pushing to Hub...")
+trainer.push_to_hub()
+print(f"Done! Model: https://huggingface.co/{OUTPUT_MODEL}")