GitHub - tobi/qmd: mini cli search engine for your docs, knowledge bases, meeting notes, whatever. Tracking current sota approaches while being all local
Service

GitHub - tobi/qmd: mini cli search engine for your docs, knowledge bases, meeting notes, whatever. Tracking current sota approaches while being all local

tobi
2026.03.06
Β·GitHubΒ·by μ΅œμ„Έμ˜
#CLI#LLM#Local AI#RAG#Search Engine

Key Points

  • 1QMD is an on-device search engine that indexes markdown notes and documents, providing highly relevant search results through a hybrid approach combining BM25, vector semantic search, and LLM re-ranking.
  • 2It leverages local GGUF models via `node-llama-cpp` for embedding, query expansion, and re-ranking within a sophisticated pipeline that includes Reciprocal Rank Fusion, top-rank bonuses, and position-aware blending.
  • 3Designed for agentic workflows, QMD offers versatile command-line options for searching, document retrieval, and collection management, with an MCP server available for tighter integration with AI agents and applications.

QMD (Query Markup Documents) is an on-device, hybrid search engine designed for personal knowledge bases, including markdown notes, meeting transcripts, and documentation. It integrates BM25 full-text search, vector semantic search, and LLM-powered re-ranking, all executed locally using node-llama-cpp with GGUF models. QMD aims to provide enhanced contextual search results suitable for agentic workflows.

The core methodology of QMD revolves around a sophisticated indexing and query pipeline:

I. Indexing Flow:

  1. Collection Management: Users define collections, which are directories of documents, optionally with glob masks.
  2. Document Processing: Markdown files within collections are parsed. A unique 6-character hash (docid) is generated for each document, and the first heading or filename is extracted as a title.
  3. FTS5 Indexing: The full text content of documents is stored and indexed in a SQLite FTS5 table (documents_fts) for fast keyword-based BM25 search.
  4. Smart Chunking: Documents are intelligently segmented into chunks of approximately 900 tokens with a 15% overlap. This process prioritizes semantic integrity by using a scoring algorithm to identify natural markdown break points.
    • Break Point Scoring: Various markdown elements (e.g., headings, code blocks, horizontal rules, blank lines, list items) are assigned base scores. Headings have higher scores (H1: 100, H6: 50), while blank lines have lower scores (20). Code fence boundaries receive a high score (80).
    • Algorithm: When nearing the 900-token target, a 200-token window before the cutoff is scanned. Each potential break point within this window is assigned a finalScore using a squared distance decay:
\[
\text{finalScore} = \text{baseScore} \times (1 - (\frac{\text{distance}}{\text{window}})^2 \times 0.7)
\]
The chunk is cut at the highest-scoring break point. This ensures that semantic units remain intact. Code blocks are protected, and break points inside them are ignored.
  1. Embedding Generation: Each formatted chunk (e.g., "title | text") is fed to the embeddinggemma-300M-Q8_0 GGUF model via node-llama-cpp's embedBatch() function to generate dense vector embeddings.
  2. Vector Storage: These embeddings are stored in a sqlite-vec index (vectors_vec), associated with the chunk's document hash and sequence number, enabling efficient semantic similarity search.

II. Query Flow (Hybrid Search Pipeline):
The qmd query command orchestrates a complex hybrid search pipeline:

  1. LLM Query Expansion: The user's original query undergoes an initial expansion phase using the qmd-query-expansion-1.7B-q4_k_m GGUF model. This generates two alternative queries based on the original query, aiming to capture broader intent.
  2. Parallel Retrieval: For the original query (weighted Γ—2) and each of the two expanded queries, parallel searches are performed:
    • BM25 Full-Text Search: Utilizes the FTS5 index for keyword-based retrieval. Raw scores are normalized by Math.abs(score).
    • Vector Semantic Search: Uses the sqlite-vec index to find semantically similar documents. Raw scores are normalized as 1/(1+distance)1 / (1 + \text{distance}), mapping cosine distance (0.0 to 1.0) to similarity (1.0 to 0.0).
  3. Reciprocal Rank Fusion (RRF): The results from all six parallel searches (3 queries Γ— 2 search types) are combined using Reciprocal Rank Fusion with a constant k=60k=60. The RRF score for a document is calculated as:
\[
\text{Score}_{\text{RRF}} = \sum_{r \in \text{Ranks}} \frac{1}{k + \text{rank}_r}
\]
where rankr\text{rank}_r is the rank of the document in a specific search list.
  • Weighting: The original query's results are given a Γ—2 weight during fusion.
  • Top-Rank Bonus: Documents ranking #1 in any individual retrieval list receive an additional +0.05 bonus, and those ranking #2-3 receive +0.02, helping to preserve high-precision matches.
  1. Top-K Selection: The top 30 candidates after RRF fusion are selected for the re-ranking stage.
  2. LLM Re-ranking: The selected candidates are passed to the qwen3-reranker-0.6b-q8_0 GGUF model. This cross-encoder reranker scores each document for relevance (0.0 to 1.0) based on the query and document content, returning results sorted by this relevance score. The reranker provides "yes/no" confidence with log-probabilities.
  3. Position-Aware Blending: The final score for each document is a blend of its RRF score (retrieval) and its LLM re-ranking score (reranker). The blending ratio is dynamically adjusted based on the document's RRF rank:
    • RRF ranks 1-3: 75% RRF score, 25% reranker score. This prioritizes the strong initial retrieval signal for top results.
    • RRF ranks 4-10: 60% RRF score, 40% reranker score.
    • RRF ranks 11+: 40% RRF score, 60% reranker score. This trusts the LLM reranker more for lower-ranked documents where initial retrieval might be less precise.
    • Reranker raw scores (0-10 rating) are normalized by dividing by 10.
  4. Final Results: Documents are presented with their path, docid, title, configured context, blended score (0.0-1.0), and a highlighted snippet.

QMD supports various commands for collection management, context addition (providing descriptive metadata to paths), embedding generation, and searching (BM25-only, vector-only, or hybrid). It offers flexible output formats (JSON, CSV, Markdown, XML) for agent integration and can expose an MCP (Model Context Protocol) server for tighter programmatic access. Models are auto-downloaded from HuggingFace and cached locally.