GitHub - karpathy/jobs: Analyzing how susceptible every occupation in the US economy is to AI and automation, using data from the Bureau of Labor Statistics
Key Points
- 1This project comprehensively analyzes the AI exposure of 342 U.S. occupations from the BLS Occupational Outlook Handbook, providing a live interactive treemap visualization.
- 2Each occupation is scored on a 0-10 AI Exposure scale by an LLM (Gemini Flash), considering direct automation and inherent digital work, with an average exposure score of 5.3/10.
- 3The treemap visualization displays occupations by employment size (area) and AI exposure (color, green to red), allowing users to explore detailed statistics and LLM rationales.
The karpathy/jobs project analyzes the susceptibility of every occupation in the US economy to AI and automation, utilizing data from the Bureau of Labor Statistics (BLS) Occupational Outlook Handbook (OOH). The study covers 342 occupations across all sectors, detailing job duties, work environment, education, pay, and employment projections.
The core methodology involves a multi-stage data pipeline:
- Scraping: The
scrape.pyscript employs Playwright in a non-headless mode (to bypass BLS bot detection) to download the raw HTML for all 342 occupation pages from the BLS website. These raw HTML files are stored in thehtml/directory, serving as the primary data source. - Parsing: The
parse_detail.pyandprocess.pyscripts utilize BeautifulSoup to convert the raw HTML content into clean Markdown files, which are then stored in thepages/directory. This step standardizes the textual descriptions of each occupation. - Tabulation: The
make_csv.pyscript extracts structured metadata from the processed occupation data. This includes fields such as typical pay, required education level, current job count, employment growth outlook, and the Standard Occupational Classification (SOC) code. This structured data is compiled intooccupations.csv. - AI Exposure Scoring: This is the central analytical component, executed by
score.py. Each occupation's Markdown description from thepages/directory is sent to a Large Language Model (LLM), specifically Gemini Flash accessed via OpenRouter, requiring anOPENROUTER_API_KEY. The LLM applies a predefined scoring rubric to assign an "AI Exposure" score ranging from 0 to 10, accompanied by a textual rationale.- Scoring Criteria: The AI Exposure score quantifies how much AI is anticipated to reshape an occupation. It considers both:
- Direct Automation: AI performing tasks traditionally done by humans.
- Indirect Effects: AI increasing human productivity to such an extent that fewer workers are needed for the same output.
- A critical signal for high exposure is whether the job's work product is fundamentally digital and can be performed entirely from a computer (e.g., from a home office). Conversely, jobs requiring physical presence, manual dexterity, or real-time human interaction are considered to have a natural barrier against AI exposure.
- Score Calibration: The 0-10 scale is calibrated with examples:
- 0-1 (Minimal): Roofers, Janitors, Construction Laborers.
- 2-3 (Low): Electricians, Plumbers, Nurses Aides, Firefighters.
- 4-5 (Moderate): Registered Nurses, Retail Workers, Physicians.
- 6-7 (High): Teachers, Managers, Accountants, Engineers.
- 8-9 (Very High): Software Developers, Paralegals, Data Analysts, Editors.
- 10 (Maximum): Medical Transcriptionists.
- The average exposure across all 342 occupations is calculated as 5.3/10. The results, including scores and rationales, are saved to
scores.json.
- Scoring Criteria: The AI Exposure score quantifies how much AI is anticipated to reshape an occupation. It considers both:
- Site Data Generation: The
build_site_data.pyscript merges the structured statistics fromoccupations.csvwith the AI exposure scores fromscores.jsoninto a compact JSON file (site/data.json). This file is optimized for frontend consumption. - Website Visualization: An interactive treemap visualization (
site/index.html) is built to present the data. In this visualization, the area of each rectangle is proportional to the total employment (number of jobs) for that occupation, and the color indicates its AI exposure on a gradient from green (safe/low exposure) to red (exposed/high exposure). Occupations are grouped by BLS category, and hovering over a rectangle displays a detailed tooltip with pay, job count, growth outlook, education requirements, exposure score, and the LLM-generated rationale.
A key output is prompt.md, a consolidated file (approximately 45,000 tokens) designed to be directly pasted into an LLM for analysis. It packages all aggregated statistics, tier breakdowns, exposure by pay/education, BLS growth projections, and all 342 occupations with their scores and rationales, enabling data-grounded conversations about AI's labor market impact without requiring code execution. Key data files include occupations.json (master list of occupations), occupations.csv (summary stats), and scores.json (AI exposure data). The project setup requires uv for dependency management, playwright for scraping, and an OpenRouter API key for LLM access.