GitHub - unclecode/crawl4ai: πŸš€πŸ€– Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper. Don't be shy, join here: https://discord.gg/jP8KfhDhyN
Service

GitHub - unclecode/crawl4ai: πŸš€πŸ€– Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper. Don't be shy, join here: https://discord.gg/jP8KfhDhyN

unclecode
2026.02.22
Β·GitHubΒ·by μ΅œμ„Έμ˜
#Data Extraction#LLM#RAG#Scraper#Web Crawler

Key Points

  • 1Crawl4AI is an open-source, LLM-friendly web crawler and scraper designed to transform web content into clean, structured Markdown, suitable for RAG, AI agents, and data pipelines.
  • 2It offers fast, controllable, and adaptive web extraction with features like LLM-driven and CSS-based structured data extraction, full browser control, and robust handling of dynamic content.
  • 3Recent updates emphasize stability, advanced deployment options with real-time monitoring and Docker integration, crash recovery, and intelligent table extraction, cementing its status as a widely recognized tool.

Crawl4AI is an open-source, LLM-friendly web crawler and scraper designed for robust, large-scale web extraction, primarily converting web content into clean, structured Markdown suitable for Retrieval-Augmented Generation (RAG), autonomous agents, and data pipelines. It emphasizes speed, control, and adaptability, built to bypass common scraping challenges.

Its core methodologies and features include:

  1. LLM-Ready Output Generation:
    • Clean Markdown: Produces well-formatted Markdown with accurate structure (headings, tables, code).
    • Fit Markdown: Employs heuristic-based filtering, specifically using a PruningContentFilter with parameters like threshold and min_word_threshold, to remove noisy or irrelevant content, making the output more concise and AI-friendly.
    • BM25 Filtering: Utilizes the BM25 algorithm (BM25ContentFilter) for relevance-based content extraction, allowing users to focus on information related to a user_query.
    • Citations and References: Automatically converts internal and external page links into numbered reference lists within the Markdown.
  1. Structured Data Extraction:
    • LLM-Driven Extraction: Supports integration with various Large Language Models (open-source and proprietary, via Litellm library) for schema-based structured data extraction. The LLMExtractionStrategy allows defining instruction prompts and output schema (Pydantic BaseModel), with input_format options like "html", "markdown", or "fit_markdown". It includes configurable rate limiting with backoff_base_delay, backoff_max_attempts, and backoff_exponential_factor for robust API interaction.
    • Chunking Strategies: Implements content chunking (topic-based, regex, sentence-level) for processing large documents, often combined with LLMTableExtraction using enable_chunking, chunk_token_threshold, and overlap_threshold to handle massive tables by processing them in manageable segments while maintaining context.
    • Cosine Similarity: Leverages cosine similarity for semantic content filtering, finding relevant content chunks based on user queries.
    • CSS-Based Extraction: Provides fast, schema-based data extraction using JsonCssExtractionStrategy with CSS selectors and XPath.
  1. Advanced Browser Integration and Control:
    • Managed Browser Pools: Utilizes Playwright for headless and headful browser automation, maintaining browser pools (permanent, hot, cold tiers) with page pre-warming for performance.
    • Browser Profiler & Session Management: Supports creating and managing persistent user profiles (user_data_dir) to preserve authentication states, cookies, and settings, enabling multi-step crawling without re-authentication.
    • Undetected Browser Support: Includes a browsertype="undetected"browser_type="undetected" option, using extra_args such as --disable-blink-features=AutomationControlled and --disable-web-security to mimic real user behavior and evade sophisticated bot detection systems like Cloudflare or Akamai.
    • Dynamic Content Handling: Executes JavaScript (js_code) and waits for asynchronous operations, simulates scrolling (Full-Page Scanning), and handles lazy-loaded elements to ensure complete content capture.
    • Customization: Offers extensive control over browser parameters, including headers, cookies, user agents, and dynamic viewport adjustment. Supports various proxy configurations.
  1. Crawling and Scraping Mechanisms:
    • Dynamic Crawling: Executes JavaScript and waits for page load events to extract dynamically generated content.
    • Comprehensive Link Extraction: Identifies and extracts internal, external, and embedded iframe links.
    • Customizable Hooks: Provides a powerful hooks_to_string utility for defining Python function-based hooks at 8 key points in the crawling lifecycle (e.g., on_page_context_created, before_goto). These hooks allow for fine-grained control over browser behavior (e.g., blocking resources, modifying requests).
    • Adaptive Crawling: An AdaptiveCrawler can learn and adapt to website patterns based on a confidence_threshold, max_depth, max_pages, and a statistical strategy to explore only relevant content paths.
    • Crash Recovery: For deep crawls, it provides on_state_change callbacks for real-time state persistence and a resume_state parameter to continue from saved checkpoints, supporting BFS, DFS, and Best-First strategies.
    • Prefetch Mode: A prefetch=Trueprefetch=True CrawlerRunConfig option enables fast URL discovery by skipping Markdown generation, extraction, and media processing, offering 5-10x speedup for two-phase crawling.
  1. Deployment and Infrastructure:
    • Dockerized Setup: Optimized Docker images with a FastAPI server provide easy deployment, supporting multi-architecture (AMD64/ARM64).
    • Monitoring: Features a real-time monitoring dashboard and a comprehensive REST API for programmatic access to system health, request tracking, and browser pool status, with WebSocket streaming for live updates.
    • Scalability: Designed for mass-scale production with optimized server performance and JWT token authentication for API security.
  1. Utility Features:
    • Caching: Caches data to improve speed and avoid redundant fetches.
    • Memory Monitoring: The MemoryMonitor tracks and reports peak memory usage and efficiency, providing insights for optimization.
    • Multi-URL Configuration: Allows defining different CrawlerRunConfig instances (e.g., cache_mode, markdown_generator_options) with url_matcher patterns (string or lambda functions) for highly customized crawling of diverse URL sets in a single batch.