The Revenge of the Data Scientist – Hamel’s Blog - Hamel Husain
Blog

The Revenge of the Data Scientist – Hamel’s Blog - Hamel Husain

Hamel Husain
2026.04.03
·Web·by 이호민
#AI#Data Science#Evaluation#LLM#Machine Learning

Key Points

  • 1This paper argues that the data scientist's role remains critical in the age of LLMs, as shipping effective AI still requires deep expertise in data understanding and system evaluation.
  • 2It highlights five pervasive pitfalls in current AI development and evaluation practices, such as generic metrics and unverified judges, which arise from neglecting core data science fundamentals.
  • 3The author asserts that "looking at the data" through methods like exploratory data analysis, rigorous experimental design, and careful labeling is indispensable for building robust and reliable AI systems.

The paper argues against the notion that the advent of large language models (LLMs) and foundation model APIs has diminished the importance of data scientists and machine learning engineers (MLEs). While these advancements simplify the deployment of AI by externalizing model training, the author contends that the critical work of setting up experiments, debugging stochastic systems, and designing effective metrics remains, and this work fundamentally relies on data science principles. The paper posits that "the harness" – the system of tests, specifications, and observability (logs, metrics, traces) that guides and constrains AI models – is largely composed of data science.

The core of the paper details five common pitfalls in evaluating LLM-based systems, demonstrating how a data science approach resolves each:

  1. Generic Metrics:
    • Pitfall: Teams use vague, off-the-shelf metrics (e.g., "helpfulness scores," "coherence scores") that fail to diagnose specific application failures.
    • Data Scientist Approach: Data scientists explore raw data and traces, perform error analysis to categorize failures, and derive application-specific, actionable metrics (e.g., "Calendar Scheduling Failure," "Failure to Escalate To Human") based on observed breakdowns. This involves deep Exploratory Data Analysis (EDA) to understand system behavior and drive towards relevant performance indicators, rather than relying on abstract similarity metrics like ROUGE or BLEU for LLM outputs.
  1. Unverified Judges:
    • Pitfall: LLMs are frequently used as evaluative judges without verifying their trustworthiness, often by simply asking them to rate outputs on a scale.
    • Data Scientist Approach: A data scientist treats the LLM judge as a classifier, applying rigorous Model Evaluation techniques. This involves collecting human labels as ground truth, partitioning data into training, development, and test sets, and quantitatively measuring the judge's performance. The judge's prompt is iteratively optimized against a development set while holding a separate test set to prevent overfitting. Instead of just reporting accuracy, which can hide performance issues for rare failure modes, data scientists would report precision and recall. For a binary classification task where 1 is the positive class and 0 is the negative class, Precision (PP) is defined as TPTP+FP\frac{TP}{TP + FP} and Recall (RR) as TPTP+FN\frac{TP}{TP + FN}, where TPTP is true positives, FPFP is false positives, and FNFN is false negatives.
  1. Bad Experimental Design:
    • Pitfall: Test sets are poorly constructed, often via generic synthetic data generation, and metric rubrics are overly complex or use subjective Likert scales.
    • Data Scientist Approach: For test set construction, data scientists prioritize analyzing real production data to identify critical dimensions and edge cases before generating synthetic examples, ensuring representativeness and grounding in actual user interactions. For metric design, they reduce complexity, making each metric actionable and directly tied to business outcomes. Subjective Likert scales (e.g., 1-5) are replaced with clear, binary (pass/fail) criteria on narrowly scoped aspects of performance, thereby eliminating ambiguity and forcing clear decisions on system efficacy. This adheres to fundamental principles of Experimental Design for robust evaluation.
  1. Bad Data and Labels:
    • Pitfall: AI engineers often delegate or outsource labeling, lacking skepticism about data quality and labels, and missing the insights gained from the labeling process itself.
    • Data Scientist Approach: Data scientists maintain skepticism about data and label quality. They insist on involving domain experts in the labeling process, recognizing that labeling is not merely a task but a crucial feedback loop for "criteria drift"—where users refine their understanding of desired output quality by interacting with model outputs. The labeling process serves as an essential part of Data Collection, allowing product managers and domain experts to directly engage with raw data and iteratively define what success looks like, rather than relying solely on summary scores.
  1. Automating Too Much:
    • Pitfall: There's a temptation to fully automate evaluation processes, assuming LLMs can perform all human-intensive tasks like "looking at the data."
    • Data Scientist Approach: Data scientists recognize that while LLMs can assist with plumbing and boilerplate code, the critical human work of understanding failures and defining what to measure cannot be fully automated. The iterative process of "looking at the data"—analyzing traces, categorizing errors, and discerning patterns—is essential for discovering unforeseen issues and establishing meaningful evaluation criteria. This emphasizes that human judgment and analytical skills remain indispensable in the Production ML lifecycle for effective monitoring and improvement.

The paper concludes that these pitfalls all stem from a lack of foundational data science skills, such as Exploratory Data Analysis, Model Evaluation, Experimental Design, Data Collection, and Production ML. The work itself, the author claims, has not changed; only the terminology has. Therefore, data scientists, with their inherent skepticism and rigorous methodological training, are more critical than ever in the age of LLMs, especially in building and refining the "harness" that ensures AI systems perform reliably and effectively.