|
@@ -294,7 +294,7 @@ qmd ls notes/subfolder
|
|
|
### Generate Vector Embeddings
|
|
### Generate Vector Embeddings
|
|
|
|
|
|
|
|
```sh
|
|
```sh
|
|
|
-# Embed all indexed documents (800 tokens/chunk, 15% overlap)
|
|
|
|
|
|
|
+# Embed all indexed documents (900 tokens/chunk, 15% overlap)
|
|
|
qmd embed
|
|
qmd embed
|
|
|
|
|
|
|
|
# Force re-embed everything
|
|
# Force re-embed everything
|
|
@@ -471,7 +471,7 @@ collections -- Indexed directories with name and glob patterns
|
|
|
path_contexts -- Context descriptions by virtual path (qmd://...)
|
|
path_contexts -- Context descriptions by virtual path (qmd://...)
|
|
|
documents -- Markdown content with metadata and docid (6-char hash)
|
|
documents -- Markdown content with metadata and docid (6-char hash)
|
|
|
documents_fts -- FTS5 full-text index
|
|
documents_fts -- FTS5 full-text index
|
|
|
-content_vectors -- Embedding chunks (hash, seq, pos, 800 tokens each)
|
|
|
|
|
|
|
+content_vectors -- Embedding chunks (hash, seq, pos, 900 tokens each)
|
|
|
vectors_vec -- sqlite-vec vector index (hash_seq key)
|
|
vectors_vec -- sqlite-vec vector index (hash_seq key)
|
|
|
llm_cache -- Cached LLM responses (query expansion, rerank scores)
|
|
llm_cache -- Cached LLM responses (query expansion, rerank scores)
|
|
|
```
|
|
```
|
|
@@ -501,11 +501,11 @@ Collection ──► Glob Pattern ──► Markdown Files ──► Parse Title
|
|
|
|
|
|
|
|
### Embedding Flow
|
|
### Embedding Flow
|
|
|
|
|
|
|
|
-Documents are chunked into 800-token pieces with 15% overlap:
|
|
|
|
|
|
|
+Documents are chunked into ~900-token pieces with 15% overlap using smart boundary detection:
|
|
|
|
|
|
|
|
```
|
|
```
|
|
|
-Document ──► Chunk (800 tokens) ──► Format each chunk ──► node-llama-cpp ──► Store Vectors
|
|
|
|
|
- │ "title | text" embedBatch()
|
|
|
|
|
|
|
+Document ──► Smart Chunk (~900 tokens) ──► Format each chunk ──► node-llama-cpp ──► Store Vectors
|
|
|
|
|
+ │ "title | text" embedBatch()
|
|
|
│
|
|
│
|
|
|
└─► Chunks stored with:
|
|
└─► Chunks stored with:
|
|
|
- hash: document hash
|
|
- hash: document hash
|
|
@@ -513,6 +513,37 @@ Document ──► Chunk (800 tokens) ──► Format each chunk ──► node
|
|
|
- pos: character position in original
|
|
- pos: character position in original
|
|
|
```
|
|
```
|
|
|
|
|
|
|
|
|
|
+### Smart Chunking
|
|
|
|
|
+
|
|
|
|
|
+Instead of cutting at hard token boundaries, QMD uses a scoring algorithm to find natural markdown break points. This keeps semantic units (sections, paragraphs, code blocks) together.
|
|
|
|
|
+
|
|
|
|
|
+**Break Point Scores:**
|
|
|
|
|
+
|
|
|
|
|
+| Pattern | Score | Description |
|
|
|
|
|
+|---------|-------|-------------|
|
|
|
|
|
+| `# Heading` | 100 | H1 - major section |
|
|
|
|
|
+| `## Heading` | 90 | H2 - subsection |
|
|
|
|
|
+| `### Heading` | 80 | H3 |
|
|
|
|
|
+| `#### Heading` | 70 | H4 |
|
|
|
|
|
+| `##### Heading` | 60 | H5 |
|
|
|
|
|
+| `###### Heading` | 50 | H6 |
|
|
|
|
|
+| ` ``` ` | 80 | Code block boundary |
|
|
|
|
|
+| `---` / `***` | 60 | Horizontal rule |
|
|
|
|
|
+| Blank line | 20 | Paragraph boundary |
|
|
|
|
|
+| `- item` / `1. item` | 5 | List item |
|
|
|
|
|
+| Line break | 1 | Minimal break |
|
|
|
|
|
+
|
|
|
|
|
+**Algorithm:**
|
|
|
|
|
+
|
|
|
|
|
+1. Scan document for all break points with scores
|
|
|
|
|
+2. When approaching the 900-token target, search a 200-token window before the cutoff
|
|
|
|
|
+3. Score each break point: `finalScore = baseScore × (1 - (distance/window)² × 0.7)`
|
|
|
|
|
+4. Cut at the highest-scoring break point
|
|
|
|
|
+
|
|
|
|
|
+The squared distance decay means a heading 200 tokens back (score ~30) still beats a simple line break at the target (score 1), but a closer heading wins over a distant one.
|
|
|
|
|
+
|
|
|
|
|
+**Code Fence Protection:** Break points inside code blocks are ignored—code stays together. If a code block exceeds the chunk size, it's kept whole when possible.
|
|
|
|
|
+
|
|
|
### Query Flow (Hybrid)
|
|
### Query Flow (Hybrid)
|
|
|
|
|
|
|
|
```
|
|
```
|