GitHub - karpathy/jobs: Analyzing how susceptible every occupation in the US economy is to AI and automation, using data from the Bureau of Labor Statistics
Blog

GitHub - karpathy/jobs: Analyzing how susceptible every occupation in the US economy is to AI and automation, using data from the Bureau of Labor Statistics

karpathy
2026.03.15
·GitHub·by 이호민
#AI#Automation#Data Analysis#LLM#Visualization

Key Points

  • 1This project comprehensively analyzes the AI exposure of 342 U.S. occupations from the BLS Occupational Outlook Handbook, providing a live interactive treemap visualization.
  • 2Each occupation is scored on a 0-10 AI Exposure scale by an LLM (Gemini Flash), considering direct automation and inherent digital work, with an average exposure score of 5.3/10.
  • 3The treemap visualization displays occupations by employment size (area) and AI exposure (color, green to red), allowing users to explore detailed statistics and LLM rationales.

The karpathy/jobs project analyzes the susceptibility of every occupation in the US economy to AI and automation, utilizing data from the Bureau of Labor Statistics (BLS) Occupational Outlook Handbook (OOH). The study covers 342 occupations across all sectors, detailing job duties, work environment, education, pay, and employment projections.

The core methodology involves a multi-stage data pipeline:

  1. Scraping: The scrape.py script employs Playwright in a non-headless mode (to bypass BLS bot detection) to download the raw HTML for all 342 occupation pages from the BLS website. These raw HTML files are stored in the html/ directory, serving as the primary data source.
  2. Parsing: The parse_detail.py and process.py scripts utilize BeautifulSoup to convert the raw HTML content into clean Markdown files, which are then stored in the pages/ directory. This step standardizes the textual descriptions of each occupation.
  3. Tabulation: The make_csv.py script extracts structured metadata from the processed occupation data. This includes fields such as typical pay, required education level, current job count, employment growth outlook, and the Standard Occupational Classification (SOC) code. This structured data is compiled into occupations.csv.
  4. AI Exposure Scoring: This is the central analytical component, executed by score.py. Each occupation's Markdown description from the pages/ directory is sent to a Large Language Model (LLM), specifically Gemini Flash accessed via OpenRouter, requiring an OPENROUTER_API_KEY. The LLM applies a predefined scoring rubric to assign an "AI Exposure" score ranging from 0 to 10, accompanied by a textual rationale.
    • Scoring Criteria: The AI Exposure score quantifies how much AI is anticipated to reshape an occupation. It considers both:
      • Direct Automation: AI performing tasks traditionally done by humans.
      • Indirect Effects: AI increasing human productivity to such an extent that fewer workers are needed for the same output.
    • A critical signal for high exposure is whether the job's work product is fundamentally digital and can be performed entirely from a computer (e.g., from a home office). Conversely, jobs requiring physical presence, manual dexterity, or real-time human interaction are considered to have a natural barrier against AI exposure.
    • Score Calibration: The 0-10 scale is calibrated with examples:
      • 0-1 (Minimal): Roofers, Janitors, Construction Laborers.
      • 2-3 (Low): Electricians, Plumbers, Nurses Aides, Firefighters.
      • 4-5 (Moderate): Registered Nurses, Retail Workers, Physicians.
      • 6-7 (High): Teachers, Managers, Accountants, Engineers.
      • 8-9 (Very High): Software Developers, Paralegals, Data Analysts, Editors.
      • 10 (Maximum): Medical Transcriptionists.
    • The average exposure across all 342 occupations is calculated as 5.3/10. The results, including scores and rationales, are saved to scores.json.
  5. Site Data Generation: The build_site_data.py script merges the structured statistics from occupations.csv with the AI exposure scores from scores.json into a compact JSON file (site/data.json). This file is optimized for frontend consumption.
  6. Website Visualization: An interactive treemap visualization (site/index.html) is built to present the data. In this visualization, the area of each rectangle is proportional to the total employment (number of jobs) for that occupation, and the color indicates its AI exposure on a gradient from green (safe/low exposure) to red (exposed/high exposure). Occupations are grouped by BLS category, and hovering over a rectangle displays a detailed tooltip with pay, job count, growth outlook, education requirements, exposure score, and the LLM-generated rationale.

A key output is prompt.md, a consolidated file (approximately 45,000 tokens) designed to be directly pasted into an LLM for analysis. It packages all aggregated statistics, tier breakdowns, exposure by pay/education, BLS growth projections, and all 342 occupations with their scores and rationales, enabling data-grounded conversations about AI's labor market impact without requiring code execution. Key data files include occupations.json (master list of occupations), occupations.csv (summary stats), and scores.json (AI exposure data). The project setup requires uv for dependency management, playwright for scraping, and an OpenRouter API key for LLM access.