Răsfoiți Sursa

Document query grammar and add skill helpers

Tobi Lutke 3 luni în urmă
părinte
comite
64ef25e1f6
7 a modificat fișierele cu 308 adăugiri și 204 ștergeri
  1. 3 4
      CHANGELOG.md
  2. 14 9
      docs/SYNTAX.md
  3. 6 5
      skills/qmd/SKILL.md
  4. 6 9
      src/mcp.ts
  5. 132 73
      src/qmd.ts
  6. 27 25
      src/store.ts
  7. 120 79
      test/structured-search.test.ts

+ 3 - 4
CHANGELOG.md

@@ -4,15 +4,15 @@
 
 ## [1.1.0] - 2026-02-20
 
-QMD now speaks in **query documents** — structured multi-line queries where each line is typed (`lex:`, `vec:`, `hyde:`, `expand:`), combining keyword precision with semantic recall. A single plain query still works exactly as before. Lex now supports quoted phrases and negation (`"C++ performance" -sports -athlete`), making intent-aware disambiguation practical. The formal query grammar is documented in `docs/SYNTAX.md`.
+QMD now speaks in **query documents** — structured multi-line queries where every line is typed (`lex:`, `vec:`, `hyde:`), combining keyword precision with semantic recall. A single plain query still works exactly as before (it's treated as an implicit `expand:` and auto-expanded by the LLM). Lex now supports quoted phrases and negation (`"C++ performance" -sports -athlete`), making intent-aware disambiguation practical. The formal query grammar is documented in `docs/SYNTAX.md`.
 
 The npm package now uses the standard `#!/usr/bin/env node` bin convention, replacing the custom bash wrapper. This fixes native module ABI mismatches when installed via bun and works on any platform with node >= 22 on PATH.
 
 ### Changes
 
-- **Query document format**: multi-line queries with typed sub-queries (`lex:`, `vec:`, `hyde:`, `expand:`). Plain queries remain the default (`expand:` implicit). First sub-query gets 2× fusion weight — put your strongest signal first. Formal grammar in `docs/SYNTAX.md`.
+- **Query document format**: multi-line queries with typed sub-queries (`lex:`, `vec:`, `hyde:`). Plain queries remain the default (`expand:` implicit, but not written inside the document). First sub-query gets 2× fusion weight — put your strongest signal first. Formal grammar in `docs/SYNTAX.md`.
 - **Lex syntax**: full BM25 operator support. `"exact phrase"` for verbatim matching; `-term` and `-"phrase"` for exclusions. Essential for disambiguation when a term is overloaded across domains (e.g. `performance -sports -athlete`).
-- **`expand:` type**: explicit auto-expansion via local LLM. Max one per query document. Identical to the prior default behavior for plain queries.
+- **`expand:` shortcut**: send a single plain query (or start the document with `expand:` on its only line) to auto-expand via the local LLM. Query documents themselves are limited to `lex`, `vec`, and `hyde` lines.
 - **MCP `query` tool** (renamed from `structured_search`): rewrote the tool description to fully teach AI agents the query document format, lex syntax, and combination strategy. Includes worked examples with intent-aware lex.
 - **HTTP `/query` endpoint** (renamed from `/search`; `/search` kept as silent alias).
 - **`collections` array filter**: filter by multiple collections in a single query (`collections: ["notes", "brain"]`). Removed the single `collection` string param — array only.
@@ -362,4 +362,3 @@ notes, journals, and meeting transcripts.
 [Unreleased]: https://github.com/tobi/qmd/compare/v1.0.0...HEAD
 [1.0.0]: https://github.com/tobi/qmd/releases/tag/v1.0.0
 [0.9.0]: https://github.com/tobi/qmd/compare/v0.8.0...v0.9.0
-

+ 14 - 9
docs/SYNTAX.md

@@ -5,9 +5,12 @@ QMD queries are structured documents with typed sub-queries. Each line specifies
 ## Grammar
 
 ```ebnf
-query_document = { line } ;
-line           = [ type ":" ] text newline ;
-type           = "lex" | "vec" | "hyde" | "expand" ;
+query          = expand_query | query_document ;
+expand_query   = text | explicit_expand ;
+explicit_expand= "expand:" text ;
+query_document = { typed_line } ;
+typed_line     = type ":" text newline ;
+type           = "lex" | "vec" | "hyde" ;
 text           = quoted_phrase | plain_text ;
 quoted_phrase  = '"' { character } '"' ;
 plain_text     = { character } ;
@@ -21,14 +24,13 @@ newline        = "\n" ;
 | `lex` | BM25 | Keyword search with exact matching |
 | `vec` | Vector | Semantic similarity search |
 | `hyde` | Vector | Hypothetical document embedding |
-| `expand` | LLM | Auto-expand into lex/vec/hyde via local model |
 
 ## Default Behavior
 
-A query without any type prefix is treated as `expand:` — it gets passed to the query expansion model which generates lex, vec, and hyde variations automatically.
+A QMD query is either a single expand query or a multi-line query document. Any single-line query with no prefix is treated as an expand query and passed to the expansion model, which emits lex, vec, and hyde variants automatically.
 
 ```
-# These are equivalent:
+# These are equivalent and cannot be combined with typed lines:
 how does authentication work
 expand: how does authentication work
 ```
@@ -89,17 +91,20 @@ hyde: The API implements rate limiting using a token bucket algorithm...
 
 ## Expand Queries
 
-Use `expand:` to leverage the local query expansion model. Limited to one per query document.
+An expand query stands alone; it's not mixed with typed lines. You can either rely on the default untyped form or add the explicit `expand:` prefix:
 
 ```
 expand: error handling best practices
+# equivalent
+error handling best practices
 ```
 
-This generates lex, vec, and hyde variations automatically. Useful when you don't know the exact terms.
+Both forms call the local query expansion model, which generates lex, vec, and hyde variations automatically.
 
 ## Constraints
 
-- Maximum one `expand:` query per document
+- Top-level query must be either a standalone expand query or a multi-line document
+- Query documents allow only `lex`, `vec`, and `hyde` typed lines (no `expand:` inside)
 - `lex` syntax (`-term`, `"phrase"`) only works in lex queries
 - Empty lines are ignored
 - Leading/trailing whitespace is trimmed

+ 6 - 5
skills/qmd/SKILL.md

@@ -37,7 +37,6 @@ Local search engine for markdown content.
 | `lex` | BM25 | Keywords — exact terms, names, code |
 | `vec` | Vector | Question — natural language |
 | `hyde` | Vector | Answer — hypothetical result (50-100 words) |
-| `expand` | LLM | Auto-expand via local model (max 1 per query) |
 
 ### Writing Good Queries
 
@@ -57,16 +56,16 @@ Local search engine for markdown content.
 - Use the vocabulary you expect in the result
 
 **expand (auto-expand)**
-- Let the local LLM generate lex/vec/hyde variations
-- Good when you don't know exact terms
-- Max one expand: per query
+- Use a single-line query (implicit) or `expand: question` on its own line
+- Lets the local LLM generate lex/vec/hyde variations
+- Do not mix `expand:` with other typed lines — it's either a standalone expand query or a full query document
 
 ### Combining Types
 
 | Goal | Approach |
 |------|----------|
 | Know exact terms | `lex` only |
-| Don't know vocabulary | `vec` or `expand` |
+| Don't know vocabulary | Use a single-line query (implicit `expand:`) or `vec` |
 | Best recall | `lex` + `vec` |
 | Complex topic | `lex` + `vec` + `hyde` |
 
@@ -107,6 +106,8 @@ qmd query $'lex: X\nvec: Y'       # Structured
 qmd query $'expand: question'     # Explicit expand
 qmd search "keywords"             # BM25 only (no LLM)
 qmd get "#abc123"                 # By docid
+qmd multi-get "journals/2026-*.md" -l 40  # Batch pull snippets by glob
+qmd multi-get notes/foo.md,notes/bar.md   # Comma-separated list, preserves order
 ```
 
 ## HTTP API

+ 6 - 9
src/mcp.ts

@@ -120,7 +120,7 @@ function buildInstructions(store: Store): string {
 
   // --- Search tool ---
   lines.push("");
-  lines.push("Search: Use `query` with sub-queries (lex/vec/hyde/expand):");
+  lines.push("Search: Use `query` with sub-queries (lex/vec/hyde):");
   lines.push("  - type:'lex' — BM25 keyword search (exact terms, fast)");
   lines.push("  - type:'vec' — semantic vector search (meaning-based)");
   lines.push("  - type:'hyde' — hypothetical document (write what the answer looks like)");
@@ -229,10 +229,9 @@ function createMcpServer(store: Store): McpServer {
   // ---------------------------------------------------------------------------
 
   const subSearchSchema = z.object({
-    type: z.enum(['lex', 'vec', 'hyde', 'expand']).describe(
-      "lex = BM25 keywords (supports \"phrase\" and -negation), " +
-      "vec = semantic question, hyde = hypothetical answer passage, " +
-      "expand = auto-expand via LLM (max 1 per query)"
+    type: z.enum(['lex', 'vec', 'hyde']).describe(
+      "lex = BM25 keywords (supports \"phrase\" and -negation); " +
+      "vec = semantic question; hyde = hypothetical answer passage"
     ),
     query: z.string().describe(
       "The query text. For lex: use keywords, \"quoted phrases\", and -negation. " +
@@ -266,8 +265,6 @@ Good lex examples:
 **hyde** — Hypothetical document. Write 50-100 words that look like the answer. Often the most powerful for nuanced topics.
 - \`The rate limiter uses a token bucket algorithm. When a client exceeds 100 req/min, subsequent requests return 429 until the window resets.\`
 
-**expand** — Auto-expand via local LLM. Generates lex+vec+hyde variations automatically. Max one per query. Useful when you don't know the exact terms.
-
 ## Strategy
 
 Combine types for best results. First sub-query gets 2× weight — put your strongest signal first.
@@ -278,7 +275,7 @@ Combine types for best results. First sub-query gets 2× weight — put your str
 | Concept search | \`vec\` only |
 | Best recall | \`lex\` + \`vec\` |
 | Complex/nuanced | \`lex\` + \`vec\` + \`hyde\` |
-| Unknown vocabulary | \`expand\` |
+| Unknown vocabulary | Use a standalone natural-language query (no typed lines) so the server can auto-expand it |
 
 ## Examples
 
@@ -306,7 +303,7 @@ Intent-aware lex (C++ performance, not sports):
       annotations: { readOnlyHint: true, openWorldHint: false },
       inputSchema: {
         searches: z.array(subSearchSchema).min(1).max(10).describe(
-          "Sub-queries to execute. First gets 2x weight. Max one expand: per query."
+          "Typed sub-queries to execute (lex/vec/hyde). First gets 2x weight."
         ),
         limit: z.number().optional().default(10).describe("Max results (default: 10)"),
         minScore: z.number().optional().default(0).describe("Min relevance 0-1 (default: 0)"),

+ 132 - 73
src/qmd.ts

@@ -1950,46 +1950,53 @@ function filterByCollections<T extends { filepath?: string; file?: string }>(res
  *   "CAP\nconsistency"               -> throws (multiple plain lines)
  */
 function parseStructuredQuery(query: string): StructuredSubSearch[] | null {
-  const lines = query.split('\n').map(l => l.trim()).filter(l => l.length > 0);
-  if (lines.length === 0) return null;
-
-  const prefixRe = /^(lex|vec|hyde|expand):\s*/i;
-  const searches: StructuredSubSearch[] = [];
-  const plainLines: string[] = [];
+  const rawLines = query.split('\n').map((line, idx) => ({
+    raw: line,
+    trimmed: line.trim(),
+    number: idx + 1,
+  })).filter(line => line.trimmed.length > 0);
+
+  if (rawLines.length === 0) return null;
+
+  const prefixRe = /^(lex|vec|hyde):\s*/i;
+  const expandRe = /^expand:\s*/i;
+  const typed: StructuredSubSearch[] = [];
+
+  for (const line of rawLines) {
+    if (expandRe.test(line.trimmed)) {
+      if (rawLines.length > 1) {
+        throw new Error(`Line ${line.number} starts with expand:, but query documents cannot mix expand with typed lines. Submit a single expand query instead.`);
+      }
+      const text = line.trimmed.replace(expandRe, '').trim();
+      if (!text) {
+        throw new Error('expand: query must include text.');
+      }
+      return null; // treat as standalone expand query
+    }
 
-  for (const line of lines) {
-    const match = line.match(prefixRe);
+    const match = line.trimmed.match(prefixRe);
     if (match) {
-      const type = match[1]!.toLowerCase() as 'lex' | 'vec' | 'hyde' | 'expand';
-      const text = line.slice(match[0].length).trim();
-      if (text.length > 0) {
-        searches.push({ type, query: text });
+      const type = match[1]!.toLowerCase() as 'lex' | 'vec' | 'hyde';
+      const text = line.trimmed.slice(match[0].length).trim();
+      if (!text) {
+        throw new Error(`Line ${line.number} (${type}:) must include text.`);
       }
-    } else {
-      plainLines.push(line);
+      if (/\r|\n/.test(text)) {
+        throw new Error(`Line ${line.number} (${type}:) contains a newline. Keep each query on a single line.`);
+      }
+      typed.push({ type, query: text, line: line.number });
+      continue;
     }
-  }
-
-  // All plain lines, no prefixes -> null (use normal expansion)
-  if (searches.length === 0 && plainLines.length === 1) {
-    return null;
-  }
 
-  // Multiple plain lines without prefixes -> ambiguous, error
-  if (plainLines.length > 1) {
-    throw new Error(
-      `Ambiguous query: multiple lines without lex:/vec:/hyde: prefix.\n` +
-      `Either use a single line (for query expansion) or prefix each line.\n` +
-      `Example:\n  lex: keyword terms\n  vec: natural language question\n  hyde: hypothetical answer passage`
-    );
-  }
+    if (rawLines.length === 1) {
+      // Single plain line -> implicit expand
+      return null;
+    }
 
-  // Mix of prefixed and one plain line -> treat plain as lex
-  if (plainLines.length === 1) {
-    searches.unshift({ type: 'lex', query: plainLines[0]! });
+    throw new Error(`Line ${line.number} is missing a lex:/vec:/hyde: prefix. Each line in a query document must start with one.`);
   }
 
-  return searches.length > 0 ? searches : null;
+  return typed.length > 0 ? typed : null;
 }
 
 function search(query: string, opts: OutputOptions): void {
@@ -2239,6 +2246,7 @@ function parseCLI() {
       },
       help: { type: "boolean", short: "h" },
       version: { type: "boolean", short: "v" },
+      skill: { type: "boolean" },
       // Search options
       n: { type: "string" },
       "min-score": { type: "string" },
@@ -2311,58 +2319,104 @@ function parseCLI() {
   };
 }
 
+function showSkill(): void {
+  const scriptDir = dirname(fileURLToPath(import.meta.url));
+  const relativePath = pathJoin("skills", "qmd", "SKILL.md");
+  const skillPath = pathJoin(scriptDir, "..", relativePath);
+
+  console.log(`QMD Skill (${relativePath})`);
+  console.log(`Location: ${skillPath}`);
+  console.log("");
+
+  if (!existsSync(skillPath)) {
+    console.error("SKILL.md not found. If you built from source, ensure skills/qmd/SKILL.md exists.");
+    return;
+  }
+
+  const content = readFileSync(skillPath, "utf-8");
+  process.stdout.write(content.endsWith("\n") ? content : content + "\n");
+}
+
 function showHelp(): void {
+  console.log("qmd — Quick Markdown Search");
+  console.log("");
   console.log("Usage:");
-  console.log("  qmd collection add [path] --name <name> --mask <pattern>  - Create/index collection");
-  console.log("  qmd collection list           - List all collections with details");
-  console.log("  qmd collection remove <name>  - Remove a collection by name");
-  console.log("  qmd collection rename <old> <new>  - Rename a collection");
-  console.log("  qmd ls [collection[/path]]    - List collections or files in a collection");
-  console.log("  qmd context add [path] \"text\" - Add context for path (defaults to current dir)");
-  console.log("  qmd context list              - List all contexts");
-  console.log("  qmd context rm <path>         - Remove context");
-  console.log("  qmd get <file>[:line] [-l N] [--from N]  - Get document (optionally from line, max N lines)");
-  console.log("  qmd multi-get <pattern> [-l N] [--max-bytes N]  - Get multiple docs by glob or comma-separated list");
-  console.log("  qmd status                    - Show index status and collections");
-  console.log("  qmd update [--pull]           - Re-index all collections (--pull: git pull first)");
-  console.log("  qmd embed [-f]                - Create vector embeddings (900 tokens/chunk, 15% overlap)");
-  console.log("  qmd cleanup                   - Remove cache and orphaned data, vacuum DB");
-  console.log("  qmd query <query>             - Search with query expansion + reranking (recommended)");
-  console.log("  qmd query 'lex:..\\nvec:...'   - Structured search (you provide lex/vec/hyde queries)");
-  console.log("  qmd search <query>            - Full-text keyword search (BM25, no LLM)");
-  console.log("  qmd vsearch <query>           - Vector similarity search (no reranking)");
-  console.log("  qmd mcp                       - Start MCP server (stdio transport)");
-  console.log("  qmd mcp --http [--port N]     - Start MCP server (HTTP transport, default port 8181)");
-  console.log("  qmd mcp --http --daemon       - Start MCP server as background daemon");
-  console.log("  qmd mcp stop                  - Stop background MCP daemon");
+  console.log("  qmd <command> [options]");
+  console.log("");
+  console.log("Primary commands:");
+  console.log("  qmd query <query>             - Hybrid search with auto expansion + reranking (recommended)");
+  console.log("  qmd query 'lex:..\\nvec:...'   - Structured query document (you provide lex/vec/hyde lines)");
+  console.log("  qmd search <query>            - Full-text BM25 keywords (no LLM)");
+  console.log("  qmd vsearch <query>           - Vector similarity only");
+  console.log("  qmd get <file>[:line] [-l N]  - Show a single document, optional line slice");
+  console.log("  qmd multi-get <pattern>       - Batch fetch via glob or comma-separated list");
+  console.log("  qmd mcp                       - Start the MCP server (stdio transport for AI agents)");
+  console.log("");
+  console.log("Collections & context:");
+  console.log("  qmd collection add/list/remove/rename/show   - Manage indexed folders");
+  console.log("  qmd context add/list/rm                      - Attach human-written summaries");
+  console.log("  qmd ls [collection[/path]]                   - Inspect indexed files");
+  console.log("");
+  console.log("Maintenance:");
+  console.log("  qmd status                    - View index + collection health");
+  console.log("  qmd update [--pull]           - Re-index collections (optionally git pull first)");
+  console.log("  qmd embed [-f]                - Generate/refresh vector embeddings");
+  console.log("  qmd cleanup                   - Clear caches, vacuum DB");
+  console.log("");
+  console.log("Query syntax (qmd query):");
+  console.log("  QMD queries are either a single expand query (no prefix) or a multi-line");
+  console.log("  document where every line is typed with lex:, vec:, or hyde:. This grammar");
+  console.log("  matches the docs in docs/SYNTAX.md and is enforced in the CLI.");
+  console.log("");
+  const grammar = [
+    `query          = expand_query | query_document ;`,
+    `expand_query   = text | explicit_expand ;`,
+    `explicit_expand= "expand:" text ;`,
+    `query_document = { typed_line } ;`,
+    `typed_line     = type ":" text newline ;`,
+    `type           = "lex" | "vec" | "hyde" ;`,
+    `text           = quoted_phrase | plain_text ;`,
+    `quoted_phrase  = '"' { character } '"' ;`,
+    `plain_text     = { character } ;`,
+    `newline        = "\\n" ;`,
+  ];
+  console.log("  Grammar:");
+  for (const line of grammar) {
+    console.log(`    ${line}`);
+  }
+  console.log("");
+  console.log("  Examples:");
+  console.log("    qmd query \"how does auth work\"                # single-line → implicit expand");
+  console.log("    qmd query $'lex: CAP theorem\\nvec: consistency'  # typed query document");
+  console.log("    qmd query $'lex: \"exact matches\" sports -baseball'  # phrase + negation lex search");
+  console.log("    qmd query $'hyde: Hypothetical answer text'       # hyde-only document");
+  console.log("");
+  console.log("  Constraints:");
+  console.log("    - Standalone expand queries cannot mix with typed lines.");
+  console.log("    - Query documents allow only lex:, vec:, or hyde: prefixes.");
+  console.log("    - Each typed line must be single-line text with balanced quotes.");
+  console.log("");
+  console.log("AI agents & integrations:");
+  console.log("  - Run `qmd mcp` to expose the MCP server (stdio) to agents/IDEs.");
+  console.log("  - `qmd --skill` prints the packaged skills/qmd/SKILL.md (path + contents).");
+  console.log("  - Advanced: `qmd mcp --http ...` and `qmd mcp --http --daemon` are optional for custom transports.");
   console.log("");
   console.log("Global options:");
-  console.log("  --index <name>             - Use custom index name (default: index)");
+  console.log("  --index <name>             - Use a named index (default: index)");
   console.log("");
   console.log("Search options:");
-  console.log("  -n <num>                   - Number of results (default: 5, or 20 for --files)");
-  console.log("  --all                      - Return all matches (use with --min-score to filter)");
+  console.log("  -n <num>                   - Max results (default 5, or 20 for --files/--json)");
+  console.log("  --all                      - Return all matches (pair with --min-score)");
   console.log("  --min-score <num>          - Minimum similarity score");
   console.log("  --full                     - Output full document instead of snippet");
-  console.log("  --line-numbers             - Add line numbers to output");
-  console.log("  --files                    - Output docid,score,filepath,context (default: 20 results)");
-  console.log("  --json                     - JSON output with snippets (default: 20 results)");
-  console.log("  --csv                      - CSV output with snippets");
-  console.log("  --md                       - Markdown output");
-  console.log("  --xml                      - XML output");
-  console.log("  -c, --collection <name>    - Filter results to a specific collection");
-  console.log("");
-  console.log("Structured queries (qmd query):");
-  console.log("  Prefix lines with lex:, vec:, or hyde: to skip automatic expansion.");
-  console.log("  lex:  BM25 keyword search (exact terms)");
-  console.log("  vec:  Vector similarity (natural language question)");
-  console.log("  hyde: Vector similarity (hypothetical answer passage)");
-  console.log("  Example: qmd query $'lex: CAP theorem\\nvec: consistency vs availability tradeoff'");
+  console.log("  --line-numbers             - Include line numbers in output");
+  console.log("  --files | --json | --csv | --md | --xml  - Output format");
+  console.log("  -c, --collection <name>    - Filter by one or more collections");
   console.log("");
   console.log("Multi-get options:");
   console.log("  -l <num>                   - Maximum lines per file");
-  console.log("  --max-bytes <num>          - Skip files larger than N bytes (default: 10240)");
-  console.log("  --json/--csv/--md/--xml/--files - Output format (same as search)");
+  console.log("  --max-bytes <num>          - Skip files larger than N bytes (default 10240)");
+  console.log("  --json/--csv/--md/--xml/--files - Same formats as search");
   console.log("");
   console.log(`Index: ${getDbPath()}`);
 }
@@ -2398,6 +2452,11 @@ if (isMain) {
     process.exit(0);
   }
 
+  if (cli.values.skill) {
+    showSkill();
+    process.exit(0);
+  }
+
   if (!cli.command || cli.values.help) {
     showHelp();
     process.exit(cli.values.help ? 0 : 1);

+ 27 - 25
src/store.ts

@@ -2082,6 +2082,17 @@ export function validateSemanticQuery(query: string): string | null {
   return null;
 }
 
+export function validateLexQuery(query: string): string | null {
+  if (/[\r\n]/.test(query)) {
+    return 'Lex queries must be a single line. Remove newline characters or split into separate lex: lines.';
+  }
+  const quoteCount = (query.match(/"/g) ?? []).length;
+  if (quoteCount % 2 === 1) {
+    return 'Lex query has an unmatched double quote ("). Add the closing quote or remove it.';
+  }
+  return null;
+}
+
 export function searchFTS(db: Database, query: string, limit: number = 20, collectionName?: string): SearchResult[] {
   const ftsQuery = buildFTS5Query(query);
   if (!ftsQuery) return [];
@@ -3164,10 +3175,12 @@ export async function vectorSearchQuery(
  * Matches the format used in QMD training data.
  */
 export interface StructuredSubSearch {
-  /** Search type: 'lex' for BM25, 'vec' for semantic, 'hyde' for hypothetical, 'expand' for LLM expansion */
-  type: 'lex' | 'vec' | 'hyde' | 'expand';
+  /** Search type: 'lex' for BM25, 'vec' for semantic, 'hyde' for hypothetical */
+  type: 'lex' | 'vec' | 'hyde';
   /** The search query text */
   query: string;
+  /** Optional line number for error reporting (CLI parser) */
+  line?: number;
 }
 
 export interface StructuredSearchOptions {
@@ -3212,36 +3225,25 @@ export async function structuredSearch(
 
   if (searches.length === 0) return [];
 
-  // Validate: max one expand query, semantic queries don't use lex syntax
-  const expandSearches = searches.filter(s => s.type === 'expand');
-  if (expandSearches.length > 1) {
-    throw new Error('Maximum one expand: query per document');
-  }
+  // Validate queries before executing
   for (const search of searches) {
-    if (search.type === 'vec' || search.type === 'hyde') {
+    const location = search.line ? `Line ${search.line}` : 'Structured search';
+    if (/[\r\n]/.test(search.query)) {
+      throw new Error(`${location} (${search.type}): queries must be single-line. Remove newline characters.`);
+    }
+    if (search.type === 'lex') {
+      const error = validateLexQuery(search.query);
+      if (error) {
+        throw new Error(`${location} (lex): ${error}`);
+      }
+    } else if (search.type === 'vec' || search.type === 'hyde') {
       const error = validateSemanticQuery(search.query);
       if (error) {
-        throw new Error(`Invalid ${search.type} query: ${error}`);
+        throw new Error(`${location} (${search.type}): ${error}`);
       }
     }
   }
 
-  // Process expand: queries by calling the query expansion model
-  let processedSearches = searches.filter(s => s.type !== 'expand');
-  if (expandSearches.length > 0) {
-    const expandQuery = expandSearches[0]!.query;
-    const expanded = await store.expandQuery(expandQuery);
-    // Add expanded queries (lex, vec, hyde from the model)
-    for (const exp of expanded) {
-      processedSearches.push({ type: exp.type as 'lex' | 'vec' | 'hyde', query: exp.text });
-    }
-    // Also add original as lex for strong signal matching
-    processedSearches.unshift({ type: 'lex', query: expandQuery });
-  }
-
-  // Use processed searches from here on
-  searches = processedSearches;
-
   const rankedLists: RankedResult[][] = [];
   const docidMap = new Map<string, string>(); // filepath -> docid
   const hasVectors = !!store.db.prepare(

+ 120 - 79
test/structured-search.test.ts

@@ -17,6 +17,7 @@ import {
   createStore,
   structuredSearch,
   validateSemanticQuery,
+  validateLexQuery,
   type StructuredSubSearch,
   type Store,
 } from "../src/store.js";
@@ -26,47 +27,53 @@ import { disposeDefaultLlamaCpp } from "../src/llm.js";
 // parseStructuredQuery Tests (CLI Parser)
 // =============================================================================
 
-/**
- * Parse structured search query syntax.
- * This is a copy of the function from qmd.ts for isolated testing.
- */
 function parseStructuredQuery(query: string): StructuredSubSearch[] | null {
-  const lines = query.split('\n').map(l => l.trim()).filter(l => l.length > 0);
-  if (lines.length === 0) return null;
+  const rawLines = query.split('\n').map((line, idx) => ({
+    raw: line,
+    trimmed: line.trim(),
+    number: idx + 1,
+  })).filter(line => line.trimmed.length > 0);
+
+  if (rawLines.length === 0) return null;
 
   const prefixRe = /^(lex|vec|hyde):\s*/i;
-  const searches: StructuredSubSearch[] = [];
-  const plainLines: string[] = [];
+  const expandRe = /^expand:\s*/i;
+  const typed: StructuredSubSearch[] = [];
 
-  for (const line of lines) {
-    const match = line.match(prefixRe);
+  for (const line of rawLines) {
+    if (expandRe.test(line.trimmed)) {
+      if (rawLines.length > 1) {
+        throw new Error(`Line ${line.number} starts with expand:, but query documents cannot mix expand with typed lines. Submit a single expand query instead.`);
+      }
+      const text = line.trimmed.replace(expandRe, '').trim();
+      if (!text) {
+        throw new Error('expand: query must include text.');
+      }
+      return null;
+    }
+
+    const match = line.trimmed.match(prefixRe);
     if (match) {
       const type = match[1]!.toLowerCase() as 'lex' | 'vec' | 'hyde';
-      const text = line.slice(match[0].length).trim();
-      if (text.length > 0) {
-        searches.push({ type, query: text });
+      const text = line.trimmed.slice(match[0].length).trim();
+      if (!text) {
+        throw new Error(`Line ${line.number} (${type}:) must include text.`);
       }
-    } else {
-      plainLines.push(line);
+      if (/\r|\n/.test(text)) {
+        throw new Error(`Line ${line.number} (${type}:) contains a newline. Keep each query on a single line.`);
+      }
+      typed.push({ type, query: text, line: line.number });
+      continue;
     }
-  }
 
-  // All plain lines, no prefixes -> null (use normal expansion)
-  if (searches.length === 0 && plainLines.length === 1) {
-    return null;
-  }
-
-  // Multiple plain lines without prefixes -> ambiguous, error
-  if (plainLines.length > 1) {
-    throw new Error("Ambiguous query: multiple lines without lex:/vec:/hyde: prefix.");
-  }
+    if (rawLines.length === 1) {
+      return null;
+    }
 
-  // Mix of prefixed and one plain line -> treat plain as lex
-  if (plainLines.length === 1) {
-    searches.unshift({ type: 'lex', query: plainLines[0]! });
+    throw new Error(`Line ${line.number} is missing a lex:/vec:/hyde: prefix. Each line in a query document must start with one.`);
   }
 
-  return searches.length > 0 ? searches : null;
+  return typed.length > 0 ? typed : null;
 }
 
 describe("parseStructuredQuery", () => {
@@ -76,6 +83,10 @@ describe("parseStructuredQuery", () => {
       expect(parseStructuredQuery("distributed systems")).toBeNull();
     });
 
+    test("explicit expand line treated as plain query", () => {
+      expect(parseStructuredQuery("expand: error handling best practices")).toBeNull();
+    });
+
     test("empty queries", () => {
       expect(parseStructuredQuery("")).toBeNull();
       expect(parseStructuredQuery("   ")).toBeNull();
@@ -86,28 +97,28 @@ describe("parseStructuredQuery", () => {
   describe("single prefixed queries", () => {
     test("lex: prefix", () => {
       const result = parseStructuredQuery("lex: CAP theorem");
-      expect(result).toEqual([{ type: "lex", query: "CAP theorem" }]);
+      expect(result).toEqual([{ type: "lex", query: "CAP theorem", line: 1 }]);
     });
 
     test("vec: prefix", () => {
       const result = parseStructuredQuery("vec: what is the CAP theorem");
-      expect(result).toEqual([{ type: "vec", query: "what is the CAP theorem" }]);
+      expect(result).toEqual([{ type: "vec", query: "what is the CAP theorem", line: 1 }]);
     });
 
     test("hyde: prefix", () => {
       const result = parseStructuredQuery("hyde: The CAP theorem states that...");
-      expect(result).toEqual([{ type: "hyde", query: "The CAP theorem states that..." }]);
+      expect(result).toEqual([{ type: "hyde", query: "The CAP theorem states that...", line: 1 }]);
     });
 
     test("uppercase prefix", () => {
-      expect(parseStructuredQuery("LEX: keywords")).toEqual([{ type: "lex", query: "keywords" }]);
-      expect(parseStructuredQuery("VEC: question")).toEqual([{ type: "vec", query: "question" }]);
-      expect(parseStructuredQuery("HYDE: passage")).toEqual([{ type: "hyde", query: "passage" }]);
+      expect(parseStructuredQuery("LEX: keywords")).toEqual([{ type: "lex", query: "keywords", line: 1 }]);
+      expect(parseStructuredQuery("VEC: question")).toEqual([{ type: "vec", query: "question", line: 1 }]);
+      expect(parseStructuredQuery("HYDE: passage")).toEqual([{ type: "hyde", query: "passage", line: 1 }]);
     });
 
     test("mixed case prefix", () => {
-      expect(parseStructuredQuery("Lex: test")).toEqual([{ type: "lex", query: "test" }]);
-      expect(parseStructuredQuery("VeC: test")).toEqual([{ type: "vec", query: "test" }]);
+      expect(parseStructuredQuery("Lex: test")).toEqual([{ type: "lex", query: "test", line: 1 }]);
+      expect(parseStructuredQuery("VeC: test")).toEqual([{ type: "vec", query: "test", line: 1 }]);
     });
   });
 
@@ -115,65 +126,71 @@ describe("parseStructuredQuery", () => {
     test("lex + vec", () => {
       const result = parseStructuredQuery("lex: keywords\nvec: natural language");
       expect(result).toEqual([
-        { type: "lex", query: "keywords" },
-        { type: "vec", query: "natural language" },
+        { type: "lex", query: "keywords", line: 1 },
+        { type: "vec", query: "natural language", line: 2 },
       ]);
     });
 
     test("all three types", () => {
       const result = parseStructuredQuery("lex: keywords\nvec: question\nhyde: hypothetical doc");
       expect(result).toEqual([
-        { type: "lex", query: "keywords" },
-        { type: "vec", query: "question" },
-        { type: "hyde", query: "hypothetical doc" },
+        { type: "lex", query: "keywords", line: 1 },
+        { type: "vec", query: "question", line: 2 },
+        { type: "hyde", query: "hypothetical doc", line: 3 },
       ]);
     });
 
     test("duplicate types allowed", () => {
       const result = parseStructuredQuery("lex: term1\nlex: term2\nlex: term3");
       expect(result).toEqual([
-        { type: "lex", query: "term1" },
-        { type: "lex", query: "term2" },
-        { type: "lex", query: "term3" },
+        { type: "lex", query: "term1", line: 1 },
+        { type: "lex", query: "term2", line: 2 },
+        { type: "lex", query: "term3", line: 3 },
       ]);
     });
 
     test("order preserved", () => {
       const result = parseStructuredQuery("hyde: passage\nvec: question\nlex: keywords");
       expect(result).toEqual([
-        { type: "hyde", query: "passage" },
-        { type: "vec", query: "question" },
-        { type: "lex", query: "keywords" },
+        { type: "hyde", query: "passage", line: 1 },
+        { type: "vec", query: "question", line: 2 },
+        { type: "lex", query: "keywords", line: 3 },
       ]);
     });
   });
 
   describe("mixed plain and prefixed", () => {
-    test("single plain line with prefixed lines -> plain becomes lex first", () => {
-      const result = parseStructuredQuery("plain keywords\nvec: semantic question");
-      expect(result).toEqual([
-        { type: "lex", query: "plain keywords" },
-        { type: "vec", query: "semantic question" },
-      ]);
+    test("plain line with prefixed lines throws helpful error", () => {
+      expect(() => parseStructuredQuery("plain keywords\nvec: semantic question"))
+        .toThrow(/missing a lex:\/vec:\/hyde:/);
     });
 
-    test("plain line prepended before other prefixed", () => {
-      const result = parseStructuredQuery("keywords\nhyde: passage\nvec: question");
-      expect(result).toEqual([
-        { type: "lex", query: "keywords" },
-        { type: "hyde", query: "passage" },
-        { type: "vec", query: "question" },
-      ]);
+    test("plain line prepended before other prefixed throws", () => {
+      expect(() => parseStructuredQuery("keywords\nhyde: passage\nvec: question"))
+        .toThrow(/missing a lex:\/vec:\/hyde:/);
     });
   });
 
   describe("error cases", () => {
     test("multiple plain lines throws", () => {
-      expect(() => parseStructuredQuery("line one\nline two")).toThrow("Ambiguous query");
+      expect(() => parseStructuredQuery("line one\nline two")).toThrow(/missing a lex:\/vec:\/hyde:/);
     });
 
     test("three plain lines throws", () => {
-      expect(() => parseStructuredQuery("a\nb\nc")).toThrow("Ambiguous query");
+      expect(() => parseStructuredQuery("a\nb\nc")).toThrow(/missing a lex:\/vec:\/hyde:/);
+    });
+
+    test("mixing expand: with other lines throws", () => {
+      expect(() => parseStructuredQuery("expand: question\nlex: keywords"))
+        .toThrow(/cannot mix expand with typed lines/);
+    });
+
+    test("expand: without text throws", () => {
+      expect(() => parseStructuredQuery("expand:   ")).toThrow(/must include text/);
+    });
+
+    test("typed line without text throws", () => {
+      expect(() => parseStructuredQuery("lex:   \nvec: real")).toThrow(/must include text/);
     });
   });
 
@@ -181,58 +198,56 @@ describe("parseStructuredQuery", () => {
     test("empty lines ignored", () => {
       const result = parseStructuredQuery("lex: keywords\n\nvec: question\n");
       expect(result).toEqual([
-        { type: "lex", query: "keywords" },
-        { type: "vec", query: "question" },
+        { type: "lex", query: "keywords", line: 1 },
+        { type: "vec", query: "question", line: 3 },
       ]);
     });
 
     test("whitespace-only lines ignored", () => {
       const result = parseStructuredQuery("lex: keywords\n   \nvec: question");
       expect(result).toEqual([
-        { type: "lex", query: "keywords" },
-        { type: "vec", query: "question" },
+        { type: "lex", query: "keywords", line: 1 },
+        { type: "vec", query: "question", line: 3 },
       ]);
     });
 
     test("leading/trailing whitespace trimmed from lines", () => {
       const result = parseStructuredQuery("  lex: keywords  \n  vec: question  ");
       expect(result).toEqual([
-        { type: "lex", query: "keywords" },
-        { type: "vec", query: "question" },
+        { type: "lex", query: "keywords", line: 1 },
+        { type: "vec", query: "question", line: 2 },
       ]);
     });
 
     test("internal whitespace preserved in query", () => {
       const result = parseStructuredQuery("lex:   multiple   spaces   ");
-      expect(result).toEqual([{ type: "lex", query: "multiple   spaces" }]);
+      expect(result).toEqual([{ type: "lex", query: "multiple   spaces", line: 1 }]);
     });
 
-    test("empty prefix value skipped", () => {
-      const result = parseStructuredQuery("lex: \nvec: actual query");
-      expect(result).toEqual([{ type: "vec", query: "actual query" }]);
+    test("empty prefix value throws", () => {
+      expect(() => parseStructuredQuery("lex: \nvec: actual query")).toThrow(/must include text/);
     });
 
-    test("only empty prefix values returns null", () => {
-      const result = parseStructuredQuery("lex: \nvec: \nhyde: ");
-      expect(result).toBeNull();
+    test("only empty prefix values throws", () => {
+      expect(() => parseStructuredQuery("lex: \nvec: \nhyde: ")).toThrow(/must include text/);
     });
   });
 
   describe("edge cases", () => {
     test("colon in query text preserved", () => {
       const result = parseStructuredQuery("lex: time: 12:30 PM");
-      expect(result).toEqual([{ type: "lex", query: "time: 12:30 PM" }]);
+      expect(result).toEqual([{ type: "lex", query: "time: 12:30 PM", line: 1 }]);
     });
 
     test("prefix-like text in query preserved", () => {
       const result = parseStructuredQuery("vec: what does lex: mean");
-      expect(result).toEqual([{ type: "vec", query: "what does lex: mean" }]);
+      expect(result).toEqual([{ type: "vec", query: "what does lex: mean", line: 1 }]);
     });
 
     test("newline in hyde passage (as single line)", () => {
       // If user wants actual newlines in hyde, they need to escape or use multiline syntax
       const result = parseStructuredQuery("hyde: The answer is X. It means Y.");
-      expect(result).toEqual([{ type: "hyde", query: "The answer is X. It means Y." }]);
+      expect(result).toEqual([{ type: "hyde", query: "The answer is X. It means Y.", line: 1 }]);
     });
   });
 });
@@ -318,6 +333,18 @@ describe("structuredSearch", () => {
       expect(r.score).toBeGreaterThanOrEqual(0.5);
     }
   });
+
+  test("throws when lex query contains newline characters", async () => {
+    await expect(structuredSearch(store, [
+      { type: "lex", query: "foo\nbar", line: 3 }
+    ])).rejects.toThrow(/Line 3 \(lex\):/);
+  });
+
+  test("throws when lex query has unmatched quote", async () => {
+    await expect(structuredSearch(store, [
+      { type: "lex", query: "\"unfinished phrase", line: 2 }
+    ])).rejects.toThrow(/unmatched double quote/);
+  });
 });
 
 // =============================================================================
@@ -346,6 +373,20 @@ describe("lex query syntax", () => {
       )).toBeNull();
     });
   });
+
+  describe("validateLexQuery", () => {
+    test("accepts basic lex query", () => {
+      expect(validateLexQuery("auth token")).toBeNull();
+    });
+
+    test("rejects newline", () => {
+      expect(validateLexQuery("foo\nbar")).toContain("single line");
+    });
+
+    test("rejects unmatched quote", () => {
+      expect(validateLexQuery("\"unfinished")).toContain("unmatched");
+    });
+  });
 });
 
 // =============================================================================