open-index/hacker-news · Datasets at Hugging Face
Key Points
- 1This dataset provides the complete and continuously updated archive of Hacker News content, including stories, comments, polls, and job postings, spanning from 2006 to the present.
- 2Organized into monthly Parquet files with real-time 5-minute updates for current activity, it offers a comprehensive and live mirror of the site, currently totaling over 47 million items.
- 3The dataset's standard Parquet structure allows for efficient querying and analysis directly from Hugging Face via tools like DuckDB, the `datasets` library, pandas, and `huggingface_hub`.
This dataset card describes the "Hacker News - Complete Archive," a comprehensive collection of all items submitted to Hacker News since its inception in October 2006, continuously updated. The archive currently contains over 47.4 million items, including stories, comments, Ask HN posts, Show HN posts, job postings, and polls. Operated by Y Combinator, Hacker News serves as a key platform for discussions within the technology community.
The core methodology for maintaining this dataset emphasizes real-time updates and data integrity. New Hacker News items are fetched from the source every 5 minutes. These newly retrieved items are committed directly as individual Parquet files, organized within a today/ directory by YYYY/MM/DD/HH/MM.parquet paths, representing 5-minute blocks of activity. To ensure long-term data consistency and authoritative completeness, at midnight UTC each day, the entire current month's data is refetched from the Hacker News API and consolidated into a single Parquet file, named YYYY/MM.parquet. Following this monthly consolidation, the individual 5-minute block files for the preceding day are removed from the today/ directory. This dual-layer approach provides immediate access to recent activity while maintaining a clean, robust, and complete historical archive. Complementing the data, stats.csv and stats_today.csv files track metadata such as item counts, ID ranges, file sizes, fetch durations, and commit timestamps for each monthly and 5-minute block, respectively, allowing for verification of pipeline progress and data completeness.
The dataset is structured with the following fields:
id:uint32, a unique identifier for each item.deleted:uint8, indicating if the item has been deleted (0 for not deleted, 1 for deleted).type:int8, categorizing the item (1 for story, 2 for comment, 3 for poll, 4 for poll option, 5 for job).by:string, the username of the item's author.time: , the UTC timestamp of the item's creation.text:string, the content of the item (e.g., comment body, story text).dead:uint8, indicating if the item is "dead" (0 for not dead, 1 for dead).parent:uint32, the ID of the parent item for comments or poll options.poll:uint32, the ID of the poll item to which a poll option belongs.kids:list, a list ofuint32IDs representing direct child items (e.g., comments on a story).url:string, the URL for story items.score:int32, the score or number of upvotes for the item.title:string, the title of the item (for stories or polls).parts:list, a list ofuint32IDs representing poll option items belonging to a poll.descendants:int32, the total number of comments for a story item.words:list, a list ofstringwords, likely extracted from the item'stextortitlefor analytical purposes.
The dataset's organization as standard Parquet files facilitates access and analysis using various tools, including DuckDB for direct querying from Hugging Face, the Hugging Face datasets library for loading specific years or streaming the full history, huggingface_hub for selective downloads, and direct usage with pandas and DuckDB for in-memory analysis.