GitHub - philschmid/clipper.js: HTML to Markdown converter and crawler.
Key Points
- 1Clipper.js is a Node.js command-line tool designed to extract and convert web page content into Markdown, serving as a terminal-based alternative to browser clipping extensions.
- 2Leveraging Mozilla Readability and Turndown, it supports clipping from URLs, HTML files, or directories, and also includes a crawling feature powered by Playwright and Crawlee for comprehensive site content capture.
- 3Beyond direct web clipping, Clipper.js can facilitate PDF to Markdown conversion by first converting PDFs to HTML, making it a versatile tool for digital content archival and note-taking.
Clipper is a Node.js command-line interface (CLI) tool designed for extracting and converting web page content into Markdown. It offers functionalities analogous to browser extensions like Evernote Web Clipper or Notion Web Clipper, but operates entirely within the terminal environment, eliminating the need for browser extensions or external account registrations.
The core methodology of Clipper is bifurcated into two primary operations: clip and crawl.
Clipping Methodology
The clip command focuses on content extraction and conversion from individual or multiple HTML sources:
- Input Acquisition: The tool accepts content from three distinct sources:
- A Uniform Resource Locator (URL) specified via the
-uflag, enabling direct processing of live web pages. - A local HTML file, identified by the
-i <file>flag. - A directory containing multiple HTML files, specified by
-i <directory>, which prompts the tool to process all HTML files within that directory.
- A Uniform Resource Locator (URL) specified via the
- Content Extraction (Readability): For web pages or HTML files, Clipper first employs Mozilla's Readability library. Readability is a JavaScript implementation of a content extraction algorithm designed to identify and parse the main article-like content from an HTML document. It heuristically analyzes the Document Object Model (DOM) to remove boilerplate elements such as navigation, advertisements, and footers, isolating the primary textual and media content (e.g., headings, paragraphs, images, lists) relevant to an article. This process aims to provide a clean, focused HTML fragment.
- Markdown Conversion (Turndown): The cleaned HTML output from Readability is then piped to Turndown. Turndown is a JavaScript library specifically engineered to convert HTML into Markdown syntax. It meticulously maps HTML tags and structures (e.g., to
#, to newline-separated text, to[text](url)) to their corresponding Markdown equivalents, ensuring proper formatting and readability in the Markdown output.
- Output Generation: The converted content can be generated in two formats:
- Markdown (default): Output as a
.mdfile, suitable for archival or note-taking. - JSON: Output as a
.jsonfile, or for directory inputs, ajsonl(JSON Lines) file, where each line represents a JSON object corresponding to a clipped document. This is particularly useful for building datasets.
- Markdown (default): Output as a
Crawling Methodology
The crawl command extends the clipping functionality to systematically process multiple pages from a given website:
- Web Crawling (Crawlee): This functionality leverages Crawlee, a robust web scraping and crawling library. Crawlee handles the intricacies of navigating websites, managing HTTP requests, respecting rate limits, and handling potential errors. It is responsible for identifying and fetching web pages based on specified criteria.
- URL and Pattern Matching: The crawling process is initiated with a base URL (
-u) and can be refined with a glob pattern (-g). The glob pattern allows users to define specific URL structures that should be included in the crawl, enabling focused data collection from a large website (e.g.,https://example.com/docs/**to crawl all pages under the/docs/path). Crawlee then traverses the site, identifying links that match the provided pattern.
- Batch Clipping: For each page successfully fetched by Crawlee that matches the glob pattern, Clipper applies the aforementioned clipping methodology (Readability for content extraction, Turndown for Markdown conversion) to its HTML content.
- Output: The results of a crawl are aggregated into a
dataset.jsonlfile by default. Each line in this file is a JSON object containing the clipped content along with potential metadata. The documentation includes a warning regarding the resource-intensive nature and potential impact of crawling on website owners.
Alternative Use Cases
Clipper also integrates into a workflow for converting PDF documents to Markdown. This is achieved by first converting the PDF to HTML using an external tool like poppler's pdftohtml utility, and then using Clipper to convert the resulting HTML to Markdown. For example: pdftohtml -c -s -noframes test.pdf test.html followed by clipper clip -i test.html.
Technical Details and Ecosystem
Clipper is distributed via npm and licensed under Apache-2.0. Its local development workflow involves standard Node.js practices, including npm install, npm run test for command-line validation, npm run build for production builds, and npm install -g . for local symlinking. The tool explicitly credits its foundational open-source libraries: Mozilla Readability, Turndown, and Crawlee, which collectively form its core processing pipeline for efficient and reliable content extraction and conversion.