5 miesięcy temu · 32112256c1
--- a/README.md
+++ b/README.md
@@ -294,7 +294,7 @@ qmd ls notes/subfolder
 
															 ### Generate Vector Embeddings
														
 
															 ```sh
														
 
															-# Embed all indexed documents (800 tokens/chunk, 15% overlap)
														
 
															+# Embed all indexed documents (900 tokens/chunk, 15% overlap)
														
 
															 qmd embed
														
 
															 # Force re-embed everything
														
@@ -471,7 +471,7 @@ collections     -- Indexed directories with name and glob patterns
 
															 path_contexts   -- Context descriptions by virtual path (qmd://...)
														
 
															 documents       -- Markdown content with metadata and docid (6-char hash)
														
 
															 documents_fts   -- FTS5 full-text index
														
 
															-content_vectors -- Embedding chunks (hash, seq, pos, 800 tokens each)
														
 
															+content_vectors -- Embedding chunks (hash, seq, pos, 900 tokens each)
														
 
															 vectors_vec     -- sqlite-vec vector index (hash_seq key)
														
 
															 llm_cache       -- Cached LLM responses (query expansion, rerank scores)
														
 
															 ```
														
@@ -501,11 +501,11 @@ Collection ──► Glob Pattern ──► Markdown Files ──► Parse Title
 
															 ### Embedding Flow
														
 
															-Documents are chunked into 800-token pieces with 15% overlap:
														
 
															+Documents are chunked into ~900-token pieces with 15% overlap using smart boundary detection:
														
 
															 ```
														
 
															-Document ──► Chunk (800 tokens) ──► Format each chunk ──► node-llama-cpp ──► Store Vectors
														
 
															-                │                    "title | text"        embedBatch()
														
 
															+Document ──► Smart Chunk (~900 tokens) ──► Format each chunk ──► node-llama-cpp ──► Store Vectors
														
 
															+                │                           "title | text"        embedBatch()
														
 
															                 │
														
 
															                 └─► Chunks stored with:
														
 
															                     - hash: document hash
														
@@ -513,6 +513,37 @@ Document ──► Chunk (800 tokens) ──► Format each chunk ──► node
 
															                     - pos: character position in original
														
 
															 ```
														
 
															+### Smart Chunking
														
 
															+
														
 
															+Instead of cutting at hard token boundaries, QMD uses a scoring algorithm to find natural markdown break points. This keeps semantic units (sections, paragraphs, code blocks) together.
														
 
															+
														
 
															+**Break Point Scores:**
														
 
															+
														
 
															+| Pattern | Score | Description |
														
 
															+|---------|-------|-------------|
														
 
															+| `# Heading` | 100 | H1 - major section |
														
 
															+| `## Heading` | 90 | H2 - subsection |
														
 
															+| `### Heading` | 80 | H3 |
														
 
															+| `#### Heading` | 70 | H4 |
														
 
															+| `##### Heading` | 60 | H5 |
														
 
															+| `###### Heading` | 50 | H6 |
														
 
															+| ` ``` ` | 80 | Code block boundary |
														
 
															+| `---` / `***` | 60 | Horizontal rule |
														
 
															+| Blank line | 20 | Paragraph boundary |
														
 
															+| `- item` / `1. item` | 5 | List item |
														
 
															+| Line break | 1 | Minimal break |
														
 
															+
														
 
															+**Algorithm:**
														
 
															+
														
 
															+1. Scan document for all break points with scores
														
 
															+2. When approaching the 900-token target, search a 200-token window before the cutoff
														
 
															+3. Score each break point: `finalScore = baseScore × (1 - (distance/window)² × 0.7)`
														
 
															+4. Cut at the highest-scoring break point
														
 
															+
														
 
															+The squared distance decay means a heading 200 tokens back (score ~30) still beats a simple line break at the target (score 1), but a closer heading wins over a distant one.
														
 
															+
														
 
															+**Code Fence Protection:** Break points inside code blocks are ignored—code stays together. If a code block exceeds the chunk size, it's kept whole when possible.
														
 
															+
														
 
															 ### Query Flow (Hybrid)
														
 
															 ```