GitHub - unclecode/crawl4ai: ππ€ Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper. Don't be shy, join here: https://discord.gg/jP8KfhDhyN
Key Points
- 1Crawl4AI is an open-source, LLM-friendly web crawler and scraper designed to transform web content into clean, structured Markdown, suitable for RAG, AI agents, and data pipelines.
- 2It offers fast, controllable, and adaptive web extraction with features like LLM-driven and CSS-based structured data extraction, full browser control, and robust handling of dynamic content.
- 3Recent updates emphasize stability, advanced deployment options with real-time monitoring and Docker integration, crash recovery, and intelligent table extraction, cementing its status as a widely recognized tool.
Crawl4AI is an open-source, LLM-friendly web crawler and scraper designed for robust, large-scale web extraction, primarily converting web content into clean, structured Markdown suitable for Retrieval-Augmented Generation (RAG), autonomous agents, and data pipelines. It emphasizes speed, control, and adaptability, built to bypass common scraping challenges.
Its core methodologies and features include:
- LLM-Ready Output Generation:
- Clean Markdown: Produces well-formatted Markdown with accurate structure (headings, tables, code).
- Fit Markdown: Employs heuristic-based filtering, specifically using a
PruningContentFilterwith parameters likethresholdandmin_word_threshold, to remove noisy or irrelevant content, making the output more concise and AI-friendly. - BM25 Filtering: Utilizes the BM25 algorithm (
BM25ContentFilter) for relevance-based content extraction, allowing users to focus on information related to auser_query. - Citations and References: Automatically converts internal and external page links into numbered reference lists within the Markdown.
- Structured Data Extraction:
- LLM-Driven Extraction: Supports integration with various Large Language Models (open-source and proprietary, via Litellm library) for schema-based structured data extraction. The
LLMExtractionStrategyallows defininginstructionprompts and outputschema(PydanticBaseModel), withinput_formatoptions like "html", "markdown", or "fit_markdown". It includes configurable rate limiting withbackoff_base_delay,backoff_max_attempts, andbackoff_exponential_factorfor robust API interaction. - Chunking Strategies: Implements content chunking (topic-based, regex, sentence-level) for processing large documents, often combined with
LLMTableExtractionusingenable_chunking,chunk_token_threshold, andoverlap_thresholdto handle massive tables by processing them in manageable segments while maintaining context. - Cosine Similarity: Leverages cosine similarity for semantic content filtering, finding relevant content chunks based on user queries.
- CSS-Based Extraction: Provides fast, schema-based data extraction using
JsonCssExtractionStrategywith CSS selectors and XPath.
- LLM-Driven Extraction: Supports integration with various Large Language Models (open-source and proprietary, via Litellm library) for schema-based structured data extraction. The
- Advanced Browser Integration and Control:
- Managed Browser Pools: Utilizes Playwright for headless and headful browser automation, maintaining browser pools (permanent, hot, cold tiers) with page pre-warming for performance.
- Browser Profiler & Session Management: Supports creating and managing persistent user profiles (
user_data_dir) to preserve authentication states, cookies, and settings, enabling multi-step crawling without re-authentication. - Undetected Browser Support: Includes a option, using
extra_argssuch as--disable-blink-features=AutomationControlledand--disable-web-securityto mimic real user behavior and evade sophisticated bot detection systems like Cloudflare or Akamai. - Dynamic Content Handling: Executes JavaScript (
js_code) and waits for asynchronous operations, simulates scrolling (Full-Page Scanning), and handles lazy-loaded elements to ensure complete content capture. - Customization: Offers extensive control over browser parameters, including headers, cookies, user agents, and dynamic viewport adjustment. Supports various proxy configurations.
- Crawling and Scraping Mechanisms:
- Dynamic Crawling: Executes JavaScript and waits for page load events to extract dynamically generated content.
- Comprehensive Link Extraction: Identifies and extracts internal, external, and embedded iframe links.
- Customizable Hooks: Provides a powerful
hooks_to_stringutility for defining Python function-based hooks at 8 key points in the crawling lifecycle (e.g.,on_page_context_created,before_goto). These hooks allow for fine-grained control over browser behavior (e.g., blocking resources, modifying requests). - Adaptive Crawling: An
AdaptiveCrawlercan learn and adapt to website patterns based on aconfidence_threshold,max_depth,max_pages, and astatisticalstrategy to explore only relevant content paths. - Crash Recovery: For deep crawls, it provides
on_state_changecallbacks for real-time state persistence and aresume_stateparameter to continue from saved checkpoints, supporting BFS, DFS, and Best-First strategies. - Prefetch Mode: A
CrawlerRunConfigoption enables fast URL discovery by skipping Markdown generation, extraction, and media processing, offering 5-10x speedup for two-phase crawling.
- Deployment and Infrastructure:
- Dockerized Setup: Optimized Docker images with a FastAPI server provide easy deployment, supporting multi-architecture (AMD64/ARM64).
- Monitoring: Features a real-time monitoring dashboard and a comprehensive REST API for programmatic access to system health, request tracking, and browser pool status, with WebSocket streaming for live updates.
- Scalability: Designed for mass-scale production with optimized server performance and JWT token authentication for API security.
- Utility Features:
- Caching: Caches data to improve speed and avoid redundant fetches.
- Memory Monitoring: The
MemoryMonitortracks and reports peak memory usage and efficiency, providing insights for optimization. - Multi-URL Configuration: Allows defining different
CrawlerRunConfiginstances (e.g.,cache_mode,markdown_generator_options) withurl_matcherpatterns (string or lambda functions) for highly customized crawling of diverse URL sets in a single batch.