GitHub - mixedbread-ai/mgrep: A calm, CLI-native way to semantically grep everything, like code, images, pdfs and more.
Key Points
- 1mgrep is a CLI tool that provides semantic, natural-language search across local files like code, images, and PDFs, with optional web integration.
- 2It modernizes the "grep" experience by enabling intent-based queries, reducing AI agent token usage by focusing on relevant snippets, and enhancing code exploration.
- 3The tool uses background indexing to sync files to cloud-backed Mixedbread Stores, offering state-of-the-art semantic retrieval and reranking for contextual search results.
mgrep is a command-line interface (CLI) tool designed to provide semantic search capabilities across various data types, serving as a modern complement to traditional grep. It addresses the limitations of lexical pattern matching by enabling natural-language queries for codebases, images, PDFs, and soon, audio and video.
The core methodology of mgrep is powered by Mixedbread Search, a full-featured search solution. This system leverages state-of-the-art semantic retrieval models, which transform both search queries and document content into high-dimensional vector embeddings. These embeddings capture the semantic meaning of the data, allowing for similarity-based retrieval rather than exact string matching. The process involves context-aware parsing, which intelligently processes different file types (e.g., understanding code structure, PDF layouts) to generate meaningful units for embedding. Optimized inference methods ensure efficient and rapid search results from these vector representations.
For indexing, mgrep employs a background syncing mechanism via the mgrep watch command. This command performs an initial synchronization of a project's files to a cloud-backed Mixedbread Store, respecting .gitignore and .mgrepignore rules. It then continuously monitors for file changes, updates, and deletions, ensuring the semantic index remains fresh and accurate. This proactive indexing allows agents and teammates to query the same up-to-date corpus without redundant uploads.
Search queries, executed via mgrep or , retrieve top-k matches from the indexed store. By default, Mixedbread's reranking mechanism is applied to these results to enhance relevance, though it can be disabled. Results include contextual hints such as relative paths, line ranges for text, or page numbers for PDFs, providing a skim-friendly experience.
mgrep integrates a web search capability, allowing users to query the internet alongside their local files. This feature queries the mixedbread/web store and merges results based on relevance, providing a unified search experience. The --answer flag can be used to generate concise summaries from search results, leveraging large language models (LLMs).
A key advantage of mgrep for coding agents is its ability to significantly reduce token usage while maintaining or improving performance. Benchmarks show that mgrep + Claude Code can use approximately 2x fewer tokens than grep-based workflows because the semantic search finds relevant snippets in fewer queries, enabling the LLM to focus its capacity on reasoning rather than scanning irrelevant code. mgrep offers assisted installation commands for various agents (e.g., Claude Code, OpenCode, Codex, Factory Droid), facilitating seamless integration.
Configuration options for mgrep include specifying maximum file sizes and counts for uploads, and these can be set via CLI flags, environment variables, or local (.mgreprc.yaml) and global (~/.config/mgrep/config.yaml) configuration files, following a clear precedence hierarchy (CLI > environment variables > local config > global config > defaults). Authentication can be done via a browser-based login flow or an API key (MXBAI_API_KEY) for CI/CD or headless environments.
In summary, mgrep transforms the traditional grep experience by incorporating advanced semantic search technologies, multimodal support, web integration, and agent-centric design, providing a powerful, intelligent, and efficient tool for navigating and understanding complex information.