Ver código fonte

feat: query document format, lex phrase/negation syntax, training data

The 'query document' is now a first-class concept in QMD: a structured
document with typed sub-queries that combine for best recall.

## Query types
- lex:    BM25 keyword search with phrase and negation syntax
- vec:    Semantic vector search (natural language questions)
- hyde:   Hypothetical document (write the expected answer)
- expand: Auto-expand via local LLM (max 1, default for plain queries)

## Lex syntax
Full BM25 operator support:
  "exact phrase"     verbatim match, no prefix
  -term              exclude documents containing term
  -"exact phrase"   exclude documents containing phrase

Examples:
  "C++ performance" optimization -sports -athlete
  "connection pool" timeout -redis
  "machine learning" -sports -athlete

## MCP tool description rewritten
The 'query' tool description now fully teaches AI agents the query
document format, lex syntax, and strategy for combining types.
Includes worked examples including intent-aware lex (C++ performance,
not sports) which is critical for disambiguation in dense corpora.

## Unit tests
11 new lex parser tests covering:
- plain terms, quoted phrases, negation, combined
- intent-aware disambiguation (performance -sports -athlete)
- only-negation returns null (FTS5 constraint)
- empty/whitespace handling

## Training data
12 new intent-aware examples for next model training round:
- Real technical topics with lex phrase+negation combinations
- Covers: C++ perf, Python memory, DB connections, rate limiting,
  SQL optimization, ML overfitting, Docker, JWT, async/await,
  git conflicts, Kubernetes, React state
- Each shows how context/intent shapes lex query construction
  (e.g. performance with C++ context → -sports -athlete exclusions)
Tobi Lütke 3 meses atrás
pai
commit
3b87e3e224

+ 12 - 0
finetune/data/qmd_expansion_lex_phrases_negation.jsonl

@@ -0,0 +1,12 @@
+{"query": "performance optimization C++ program", "intent": "optimizing a C++ application", "output": [["lex", "\"C++ performance\" optimization -sports -athlete"], ["lex", "\"memory allocation\" \"cache miss\" profiling"], ["vec", "how to improve the runtime performance of a C++ program"], ["hyde", "C++ performance optimization involves profiling with tools like perf or valgrind, reducing memory allocations, improving cache locality, and avoiding unnecessary copies. Key techniques include move semantics, reserve on vectors, and SIMD intrinsics for hot loops."]]}
+{"query": "python memory leak debugging", "intent": "debugging a Python application", "output": [["lex", "\"memory leak\" python -java -javascript"], ["lex", "\"garbage collector\" reference counting tracemalloc"], ["vec", "how to find and fix memory leaks in a Python application"], ["hyde", "Python memory leaks often occur when objects hold references longer than expected, preventing garbage collection. Use tracemalloc or memory_profiler to track allocations over time. Common causes include circular references, global caches, and event listeners that aren't removed."]]}
+{"query": "connection pool exhaustion database", "intent": "debugging database connectivity issues", "output": [["lex", "\"connection pool\" exhausted timeout -redis -cache"], ["lex", "\"max connections\" database pool overflow"], ["vec", "why does the database connection pool run out of connections under load"], ["hyde", "Connection pool exhaustion occurs when all connections are checked out and new requests must wait or fail. This typically happens when queries run slowly, connections aren't released properly, or traffic spikes beyond pool capacity. Fix by increasing pool size, setting connection timeouts, or optimizing slow queries."]]}
+{"query": "rate limiting API implementation", "intent": "implementing rate limiting in a backend service", "output": [["lex", "\"rate limit\" \"token bucket\" OR \"sliding window\" -ethereum -cryptocurrency"], ["lex", "\"requests per second\" throttle middleware"], ["vec", "how to implement rate limiting for an API endpoint"], ["hyde", "A token bucket rate limiter grants each client N tokens per time window. Each request consumes one token. When tokens run out, requests are rejected with 429 Too Many Requests. Tokens refill at a fixed rate. This allows short bursts while enforcing long-term average rate limits."]]}
+{"query": "SQL query performance slow", "intent": "optimizing a slow database query", "output": [["lex", "\"query plan\" index scan \"full table\" slow -nosql -mongodb"], ["lex", "\"missing index\" \"N+1\" joins optimization"], ["vec", "why is my SQL query running slowly and how do I optimize it"], ["hyde", "Slow SQL queries are usually caused by missing indexes, full table scans, or N+1 query patterns. Use EXPLAIN ANALYZE to view the query plan. Add indexes on columns used in WHERE, JOIN, and ORDER BY clauses. Avoid SELECT * and fetch only the columns needed."]]}
+{"query": "machine learning model overfitting", "intent": "training a machine learning model", "output": [["lex", "\"overfitting\" \"validation loss\" regularization -sports -fitness"], ["lex", "dropout \"weight decay\" \"early stopping\" training"], ["vec", "how to detect and prevent overfitting in a machine learning model"], ["hyde", "Overfitting occurs when a model memorizes training data instead of learning generalizable patterns, leading to high training accuracy but poor validation performance. Prevention techniques include dropout, L1/L2 regularization, data augmentation, and early stopping based on validation loss."]]}
+{"query": "docker container networking", "intent": "configuring Docker networking", "output": [["lex", "\"docker network\" bridge overlay -kubernetes -k8s"], ["lex", "\"container ip\" \"port mapping\" expose publish"], ["vec", "how do Docker containers communicate with each other on the same host"], ["hyde", "Docker containers on the same host communicate via bridge networks. By default, containers on the bridge network can reach each other by IP. Use docker network create to make named networks, then --network flag to attach containers. Services can reference each other by container name as DNS hostname."]]}
+{"query": "authentication token JWT", "intent": "implementing JWT authentication", "output": [["lex", "\"JWT\" \"json web token\" \"secret key\" -oauth -saml"], ["lex", "\"bearer token\" signature expiry claims"], ["vec", "how does JWT authentication work and how do I validate a token"], ["hyde", "JWT tokens consist of three base64-encoded parts: header (algorithm), payload (claims like user id and expiry), and signature. The server validates the signature using a secret key and checks that the token hasn't expired. Never store sensitive data in the payload as it's only encoded, not encrypted."]]}
+{"query": "async await error handling javascript", "intent": "writing async JavaScript code", "output": [["lex", "\"async\" \"await\" \"try catch\" promise -python -rust"], ["lex", "\"unhandled rejection\" \"error boundary\" async"], ["vec", "how to properly handle errors in async/await JavaScript functions"], ["hyde", "Wrap await calls in try/catch blocks to handle rejections. Unhandled promise rejections crash Node.js processes. For parallel operations, use Promise.allSettled() instead of Promise.all() to handle partial failures. Always chain .catch() or use try/catch — never leave promises floating."]]}
+{"query": "git merge conflict resolution", "intent": "resolving a git merge conflict", "output": [["lex", "\"merge conflict\" \"<<<<<<\" rebase resolution -github -gitlab"], ["lex", "\"conflict markers\" \"ours\" \"theirs\" checkout"], ["vec", "how do I resolve a git merge conflict between two branches"], ["hyde", "Git merge conflicts occur when two branches change the same lines. Conflict markers show both versions: <<<<<<< HEAD (your changes), ======= (separator), and >>>>>>> branch-name (incoming). Edit the file to keep the correct version, remove the markers, then git add and git commit to complete the merge."]]}
+{"query": "kubernetes pod crashloopbackoff", "intent": "debugging a Kubernetes deployment", "output": [["lex", "\"CrashLoopBackOff\" pod logs restart -docker -vagrant"], ["lex", "\"container failed\" liveness probe startup"], ["vec", "why is my Kubernetes pod stuck in CrashLoopBackOff and how do I fix it"], ["hyde", "CrashLoopBackOff means the container keeps crashing and Kubernetes is backing off restarts exponentially. Check logs with kubectl logs --previous to see the last crash output. Common causes: misconfigured environment variables, missing secrets, failed liveness probes, or an application crash on startup."]]}
+{"query": "react state management redux", "intent": "managing state in a React application", "output": [["lex", "\"Redux\" \"useReducer\" \"action creator\" -angular -vue"], ["lex", "\"store dispatch\" \"selector\" \"middleware\" thunk"], ["vec", "when should I use Redux versus local React state for state management"], ["hyde", "Redux is best for state shared across many components or that needs to persist across routes. Local useState is fine for UI state scoped to one component. For medium complexity, useContext + useReducer avoids Redux boilerplate. Redux shines for large apps with complex update logic, time-travel debugging, or middleware needs."]]}

+ 66 - 16
src/mcp.ts

@@ -230,29 +230,79 @@ function createMcpServer(store: Store): McpServer {
 
 
   const subSearchSchema = z.object({
   const subSearchSchema = z.object({
     type: z.enum(['lex', 'vec', 'hyde', 'expand']).describe(
     type: z.enum(['lex', 'vec', 'hyde', 'expand']).describe(
-      "Query type: 'lex' = BM25 keywords, 'vec' = semantic question, " +
-      "'hyde' = hypothetical answer, 'expand' = auto-expand via LLM (max 1)"
+      "lex = BM25 keywords (supports \"phrase\" and -negation), " +
+      "vec = semantic question, hyde = hypothetical answer passage, " +
+      "expand = auto-expand via LLM (max 1 per query)"
+    ),
+    query: z.string().describe(
+      "The query text. For lex: use keywords, \"quoted phrases\", and -negation. " +
+      "For vec: natural language question. For hyde: 50-100 word answer passage."
     ),
     ),
-    query: z.string().describe("The query text"),
   });
   });
 
 
   server.registerTool(
   server.registerTool(
     "query",
     "query",
     {
     {
       title: "Query",
       title: "Query",
-      description: `Search the knowledge base with typed sub-queries.
-
-**Query types:**
-- \`lex\`: BM25 keyword search. Supports "exact phrase" and -negation.
-- \`vec\`: Semantic vector search. Natural language questions.
-- \`hyde\`: Hypothetical document. Write what the answer looks like (50-100 words).
-- \`expand\`: Auto-expand via local LLM. Max one per query.
-
-**Examples:**
-- Quick lookup: [{ type: "lex", query: "CAP theorem" }]
-- Semantic: [{ type: "vec", query: "consistency vs availability tradeoff" }]
-- Best results: [{ type: "lex", query: "CAP" }, { type: "vec", query: "distributed systems consistency" }]
-- Auto-expand: [{ type: "expand", query: "how does rate limiting work" }]`,
+      description: `Search the knowledge base using a query document — one or more typed sub-queries combined for best recall.
+
+## Query Types
+
+**lex** — BM25 keyword search. Fast, exact, no LLM needed.
+Full lex syntax:
+- \`term\` — prefix match ("perf" matches "performance")
+- \`"exact phrase"\` — phrase must appear verbatim
+- \`-term\` or \`-"phrase"\` — exclude documents containing this
+
+Good lex examples:
+- \`"connection pool" timeout -redis\`
+- \`"machine learning" -sports -athlete\`
+- \`handleError async typescript\`
+
+**vec** — Semantic vector search. Write a natural language question. Finds documents by meaning, not exact words.
+- \`how does the rate limiter handle burst traffic?\`
+- \`what is the tradeoff between consistency and availability?\`
+
+**hyde** — Hypothetical document. Write 50-100 words that look like the answer. Often the most powerful for nuanced topics.
+- \`The rate limiter uses a token bucket algorithm. When a client exceeds 100 req/min, subsequent requests return 429 until the window resets.\`
+
+**expand** — Auto-expand via local LLM. Generates lex+vec+hyde variations automatically. Max one per query. Useful when you don't know the exact terms.
+
+## Strategy
+
+Combine types for best results. First sub-query gets 2× weight — put your strongest signal first.
+
+| Goal | Approach |
+|------|----------|
+| Know exact term/name | \`lex\` only |
+| Concept search | \`vec\` only |
+| Best recall | \`lex\` + \`vec\` |
+| Complex/nuanced | \`lex\` + \`vec\` + \`hyde\` |
+| Unknown vocabulary | \`expand\` |
+
+## Examples
+
+Simple lookup:
+\`\`\`json
+[{ "type": "lex", "query": "CAP theorem" }]
+\`\`\`
+
+Best recall on a technical topic:
+\`\`\`json
+[
+  { "type": "lex", "query": "\\"connection pool\\" timeout -redis" },
+  { "type": "vec", "query": "why do database connections time out under load" },
+  { "type": "hyde", "query": "Connection pool exhaustion occurs when all connections are in use and new requests must wait. This typically happens under high concurrency when queries run longer than expected." }
+]
+\`\`\`
+
+Intent-aware lex (C++ performance, not sports):
+\`\`\`json
+[
+  { "type": "lex", "query": "\\"C++ performance\\" optimization -sports -athlete" },
+  { "type": "vec", "query": "how to optimize C++ program performance" }
+]
+\`\`\``,
       annotations: { readOnlyHint: true, openWorldHint: false },
       annotations: { readOnlyHint: true, openWorldHint: false },
       inputSchema: {
       inputSchema: {
         searches: z.array(subSearchSchema).min(1).max(10).describe(
         searches: z.array(subSearchSchema).min(1).max(10).describe(

+ 101 - 0
test/structured-search.test.ts

@@ -348,3 +348,104 @@ describe("lex query syntax", () => {
     });
     });
   });
   });
 });
 });
+
+// =============================================================================
+// buildFTS5Query Tests (lex parser)
+// =============================================================================
+
+describe("buildFTS5Query (lex parser)", () => {
+  // Mirror the function for unit testing
+  function sanitizeFTS5Term(term: string): string {
+    return term.replace(/[^\p{L}\p{N}']/gu, '').toLowerCase();
+  }
+
+  function buildFTS5Query(query: string): string | null {
+    const positive: string[] = [];
+    const negative: string[] = [];
+    let i = 0;
+    const s = query.trim();
+
+    while (i < s.length) {
+      while (i < s.length && /\s/.test(s[i]!)) i++;
+      if (i >= s.length) break;
+      const negated = s[i] === '-';
+      if (negated) i++;
+
+      if (s[i] === '"') {
+        const start = i + 1; i++;
+        while (i < s.length && s[i] !== '"') i++;
+        const phrase = s.slice(start, i).trim();
+        i++;
+        if (phrase.length > 0) {
+          const sanitized = phrase.split(/\s+/).map((t: string) => sanitizeFTS5Term(t)).filter((t: string) => t).join(' ');
+          if (sanitized) (negated ? negative : positive).push(`"${sanitized}"`);
+        }
+      } else {
+        const start = i;
+        while (i < s.length && !/[\s"]/.test(s[i]!)) i++;
+        const term = s.slice(start, i);
+        const sanitized = sanitizeFTS5Term(term);
+        if (sanitized) (negated ? negative : positive).push(`"${sanitized}"*`);
+      }
+    }
+
+    if (positive.length === 0 && negative.length === 0) return null;
+    if (positive.length === 0) return null;
+
+    let result = positive.join(' AND ');
+    for (const neg of negative) result = `${result} NOT ${neg}`;
+    return result;
+  }
+
+  test("plain terms → prefix match with AND", () => {
+    expect(buildFTS5Query("foo bar")).toBe('"foo"* AND "bar"*');
+  });
+
+  test("single term", () => {
+    expect(buildFTS5Query("performance")).toBe('"performance"*');
+  });
+
+  test("quoted phrase → exact match (no prefix)", () => {
+    expect(buildFTS5Query('"machine learning"')).toBe('"machine learning"');
+  });
+
+  test("quoted phrase with mixed case sanitized", () => {
+    expect(buildFTS5Query('"C++ performance"')).toBe('"c performance"');
+  });
+
+  test("negation of term", () => {
+    expect(buildFTS5Query("performance -sports")).toBe('"performance"* NOT "sports"*');
+  });
+
+  test("negation of phrase", () => {
+    expect(buildFTS5Query('performance -"sports athlete"')).toBe('"performance"* NOT "sports athlete"');
+  });
+
+  test("multiple negations", () => {
+    expect(buildFTS5Query("performance -sports -athlete")).toBe('"performance"* NOT "sports"* NOT "athlete"*');
+  });
+
+  test("quoted positive + negation", () => {
+    expect(buildFTS5Query('"machine learning" -sports -athlete')).toBe('"machine learning" NOT "sports"* NOT "athlete"*');
+  });
+
+  test("intent-aware C++ performance example", () => {
+    const result = buildFTS5Query('"C++ performance" optimization -sports -athlete');
+    expect(result).toContain('NOT "sports"*');
+    expect(result).toContain('NOT "athlete"*');
+    expect(result).toContain('"optimization"*');
+  });
+
+  test("only negations with no positives → null (can't search)", () => {
+    expect(buildFTS5Query("-sports -athlete")).toBeNull();
+  });
+
+  test("empty string → null", () => {
+    expect(buildFTS5Query("")).toBeNull();
+    expect(buildFTS5Query("   ")).toBeNull();
+  });
+
+  test("special chars in terms stripped", () => {
+    expect(buildFTS5Query("hello!world")).toBe('"helloworld"*');
+  });
+});