
Advanced Data Analysis from an Elementary Point of View
Key Points
- 1This document describes "Advanced Data Analysis from an Elementary Point of View," a draft textbook intended for advanced undergraduates with prerequisites in probability, statistics, and linear regression.
- 2The book's extensive content covers Regression and its Generalizations, Distributions and Latent Structure, Causal Inference, and Dependent Data, along with several online appendices.
- 3Originating from Carnegie Mellon University lecture notes, the evolving draft is under contract with Cambridge University Press and remains freely accessible online.
This document describes a draft textbook titled "Advanced Data Analysis from an Elementary Point of View," authored by Cosma Rohilla Shalizi. Intended for advanced undergraduate students who have already completed courses in probability, mathematical statistics, and linear regression, the book originated from lecture notes for course 36-402 at Carnegie Mellon University. It aims to provide a comprehensive treatment of modern data analysis methods from an accessible perspective. The current draft is available in PDF format, accompanied by directory links to R code files for examples and corresponding data sets. The book is under contract with Cambridge University Press, with a commitment to keep a "next-to-final" version freely accessible online.
The textbook's methodology is structured into four main parts, along with online-only appendices, covering a wide array of statistical and machine learning techniques:
I. Regression and Its Generalizations: This section initiates with fundamental regression concepts ("Regression Basics") and then critically examines the properties and assumptions of linear regression ("The Truth about Linear Regression"). It delves into methods for assessing model performance ("Model Evaluation"), including techniques like cross-validation and residual analysis. The textbook then explores flexible non-parametric and semi-parametric regression approaches, such as "Smoothing in Regression" (e.g., kernel regression, local polynomial regression) and the use of "Splines" (e.g., B-splines, natural splines, smoothing splines) to capture non-linear relationships. "Additive Models" and "Generalized Additive Models" extend this framework, modeling the response as a sum of smooth functions of predictors, often using backfitting algorithms. The resampling technique of "The Bootstrap" is introduced for inference without strong distributional assumptions. Concepts related to validating model assumptions ("Testing Regression Specifications") and handling heteroskedasticity or differential data importance ("Weighting and Variance") are covered. It also addresses "Logistic Regression" for binary outcomes, a type of generalized linear model, and introduces "Generalized Linear Models" (GLMs) more broadly. Finally, decision-tree-based methods for both classification and regression are explored through "Classification and Regression Trees" (CART).
II. Distributions and Latent Structure: This part focuses on understanding data distributions and uncovering underlying structures. It includes "Density Estimation" techniques (e.g., kernel density estimation) for estimating probability density functions non-parametrically. "Principal Components Analysis" (PCA) is presented as a method for dimensionality reduction and identifying orthogonal linear combinations of variables that capture maximum variance. "Factor Models" are discussed for explaining observed correlations among variables through a smaller set of unobserved (latent) factors. "Mixture Models" (e.g., Gaussian Mixture Models) provide a framework for modeling populations composed of several distinct subpopulations, often estimated using Expectation-Maximization (EM) algorithms. "Graphical Models" introduce a way to represent conditional dependencies between variables using graphs, including Bayesian networks and Markov random fields.
III. Causal Inference: This dedicated section covers the critical area of inferring causal relationships from observational data. It begins with "Graphical Causal Models" (e.g., Directed Acyclic Graphs or DAGs) for representing causal assumptions. Techniques for "Identifying Causal Effects" are discussed, including concepts like d-separation, backdoor criterion, and front-door criterion. Methods for "Estimating Causal Effects" are presented, such as instrumental variables, matching, propensity score methods, and regression discontinuity. The final topic, "Discovering Causal Structure," addresses algorithms for learning causal graphs from data.
IV. Dependent Data: This section focuses on analyzing data with temporal or spatial dependencies. It specifically covers "Time Series" analysis, which involves models for data collected sequentially over time (e.g., ARMA, ARIMA models, state-space models). "Simulation-Based Inference" further explores using computational simulations for complex statistical problems.
Online-only Appendices: Supplementary topics include "Big O and Little o Notation" for asymptotic analysis, "Taylor Expansions" for approximating functions, "Propagation of Error, and Standard Errors for Derived Quantities" for uncertainty quantification, "Optimization" algorithms, "Relative Distributions and Smooth Tests of Goodness of Fit," "Nonlinear Dimensionality Reduction" techniques, "Rudimentary Graph Theory" (relevant for graphical models), methods for handling "Missing Data" (e.g., imputation), and guidance on "Writing R Functions."
Planned changes include unifying information theory treatment, refining nonparametric instrumental variables, streamlining the time series chapter, adding an appendix on heuristic essential asymptotics, standardizing notation, and ensuring consistency across the manuscript.