Mathematical Statistics
Estimation, hypothesis testing, Bayesian inference, and high-dimensional statistics.
Mathematical statistics is the discipline that transforms probability theory into a rigorous framework for learning from data. Where probability theory asks “given a model, what outcomes should we expect?”, statistics inverts the question: “given outcomes we have observed, what can we conclude about the model?” This inversion — from data to inference — is at once a simple idea and an endlessly deep one, carrying consequences that touch every quantitative science.
Foundations of Statistical Inference
The enterprise of statistical inference begins with a statistical model: a family of probability distributions indexed by an unknown parameter that ranges over a parameter space . We observe data , assumed to be drawn from some particular — the true but unknown distribution — and our goal is to say something useful about . The model is parametric when for some finite , and nonparametric when is an infinite-dimensional space such as the collection of all continuous densities.
The likelihood function is the central object linking data to parameters. Given observations from a model with density or mass function , the likelihood is
and the log-likelihood is . Crucially, the likelihood is viewed as a function of for fixed data — not as a probability over . It measures how compatible each parameter value is with what we actually observed. Ronald Aylmer Fisher, in a series of papers beginning in 1912 and consolidated in his 1922 paper “On the Mathematical Foundations of Theoretical Statistics,” essentially single-handedly constructed the modern framework: he introduced the likelihood function, maximum likelihood estimation, sufficient statistics, and the concept of information, all within a few years of revolutionary output.
A sufficient statistic captures everything the sample has to say about : formally, the conditional distribution of the data given does not depend on . The Fisher-Neyman factorization theorem characterizes sufficiency elegantly: is sufficient if and only if the likelihood factors as , where does not depend on . For the normal model with known variance, the sample mean is sufficient for the mean parameter; for the Poisson, the sample sum is sufficient for the rate. The exponential family — distributions of the form — is the natural setting for sufficiency, as is always sufficient and the family has elegant mathematical properties that recur throughout statistics.
Completeness adds a further regularity: a sufficient statistic is complete if the only function satisfying for all is almost everywhere. Completeness, combined with sufficiency, enables uniqueness results in estimation theory. The concept is due to Erich Leo Lehmann and Henry Scheffé, whose 1950 and 1955 papers gave it its modern form.
Point Estimation and Maximum Likelihood
A point estimator is any statistic used to produce a single value as a guess for the unknown . The simplest quality criterion is bias: the bias of is . An estimator with zero bias is called unbiased. Unbiasedness is a natural desideratum, but it is not sufficient on its own — among all unbiased estimators, we prefer those with smaller variance. The gold standard is the Uniformly Minimum Variance Unbiased Estimator (UMVUE): the unique unbiased estimator that achieves minimum variance simultaneously for every value of .
The Rao-Blackwell theorem provides a constructive route to UMVUEs. It states that if is any unbiased estimator and is a sufficient statistic, then the conditional expectation is also unbiased and has variance no greater than that of at every . Conditioning on a sufficient statistic can only improve an estimator. The Lehmann-Scheffé theorem completes the picture: if is a complete sufficient statistic and is unbiased for , then is the unique UMVUE.
The maximum likelihood estimator (MLE) maximizes the likelihood over :
The MLE may not be unbiased — the MLE for the variance of a normal distribution is biased downward — but it enjoys a remarkable collection of large-sample properties under mild regularity conditions: consistency, asymptotic normality, and asymptotic efficiency. The MLE is also equivariant under reparametrization: if and is the MLE of , then is the MLE of . This invariance is a practically important feature absent from the UMVUE framework.
The method of moments is an older and simpler approach, attributed in systematic form to Karl Pearson around 1894: equate the first population moments to the corresponding sample moments , and solve for . Method of moments estimators are typically consistent and asymptotically normal, but rarely efficient. Their main virtue is computational: they often yield explicit closed-form formulas when the MLE equations are analytically intractable.
Cramer-Rao Theory and Fisher Information
The score function is the gradient of the log-likelihood with respect to :
Under regularity conditions (primarily, that differentiation and integration can be exchanged), the expected score is zero: . The Fisher information is the variance of the score:
Fisher information measures how sharply the log-likelihood peaks around its maximum — equivalently, how much information a single observation carries about . For independent and identically distributed observations, the total Fisher information is simply .
The Cramér-Rao lower bound (CRLB) is the fundamental limitation on estimation precision. It states that for any unbiased estimator ,
The bound was discovered independently by Harald Cramér and C. R. Rao in 1945-1946. The proof is a beautiful application of the Cauchy-Schwarz inequality. In the multiparameter case, the bound generalizes: for an unbiased estimator of , the covariance matrix satisfies , where is now the Fisher information matrix and denotes the positive semidefinite ordering.
An estimator that achieves the Cramér-Rao bound for all is called efficient. Such estimators exist if and only if the score function has the form — a condition met exactly within the exponential family. When the bound is not achieved for finite , it is often achieved asymptotically: the MLE satisfies
making it asymptotically efficient — no regular estimator can do better in large samples. This asymptotic optimality result, proved rigorously by Lucien Le Cam through the theory of Local Asymptotic Normality (LAN), is one of the central achievements of twentieth-century statistics.
Hypothesis Testing and Confidence Intervals
In hypothesis testing, we are asked to choose between a null hypothesis and an alternative . A test is a rule for deciding which hypothesis to accept, typically encoded by a rejection region : we reject when the data falls in . There are two kinds of errors: a Type I error (rejecting when it is true) occurs with probability , and a Type II error (failing to reject when is true) occurs with probability . The power of a test at a specific is . The goal is to control Type I error at a specified level while maximizing power.
The cornerstone of classical hypothesis testing is the Neyman-Pearson lemma (1933). For a simple null against a simple alternative , the most powerful test at level rejects when the likelihood ratio exceeds a threshold:
This result is exactly optimal: no other level- test has higher power against . When the alternative is composite — a range of parameter values rather than a single point — the story is more subtle. A uniformly most powerful (UMP) test dominates all competitors simultaneously for every . UMP tests exist in one-parameter exponential families for one-sided hypotheses; for two-sided alternatives, they generally do not exist, and one must settle for unbiased or invariant tests.
The likelihood ratio test (LRT) extends naturally to composite hypotheses. Its statistic is
where is the restricted MLE under . Wilks’ theorem (1938) states that under and regularity conditions, , where is the number of constraints imposed by the null. This provides asymptotic critical values without requiring knowledge of the exact null distribution. The Wald test and score test (Rao test) are asymptotically equivalent to the LRT under the null, forming a classical trinity of large-sample tests.
A confidence interval at level is a data-dependent interval satisfying for all . There is a deep duality between hypothesis tests and confidence intervals: the set of parameter values not rejected by a level- test constitutes a level- confidence set. A standard construction uses a pivotal quantity — a function of the data and parameter whose distribution is free of . For the normal mean with known variance, is standard normal, yielding the familiar interval .
Bayesian Inference
The Bayesian approach to statistics begins with a prior distribution that encodes beliefs about before observing data. Upon observing , Bayes’ theorem yields the posterior distribution:
The posterior combines prior belief and data evidence, and all Bayesian inference flows from it. A point estimate might be the posterior mean (optimal under squared error loss), the posterior median (optimal under absolute error loss), or the posterior mode (the MAP estimate). A credible interval with provides an interval estimate with a direct probability interpretation that frequentist confidence intervals lack.
The Bayesian framework was articulated philosophically by Thomas Bayes in his posthumous 1763 essay, formalized mathematically by Pierre-Simon Laplace in his Théorie analytique des probabilités (1812), and given its modern decision-theoretic foundation by Leonard Jimmie Savage in The Foundations of Statistics (1954) and Bruno de Finetti in his theory of subjective probability. For centuries, Bayesian and frequentist approaches were competing philosophical camps; modern statistics increasingly views them as complementary tools.
A conjugate prior is one that yields a posterior in the same parametric family. The beta-binomial pair is the canonical example: if and , then
The prior parameters and act as pseudo-counts from imaginary previous observations, and updating simply adds real observations to them. Conjugate pairs exist for every exponential family model — a consequence of the algebraic structure of these families. When no conjugate prior is available, Markov Chain Monte Carlo (MCMC) methods — Metropolis-Hastings, Gibbs sampling, Hamiltonian Monte Carlo — allow approximate posterior computation for arbitrarily complex models.
The choice of prior is the most contested aspect of Bayesian inference. Jeffreys priors are non-informative priors derived from the Fisher information: . They are invariant under reparametrization — a crucial property ensuring that the “uninformative” label is consistent regardless of how the problem is framed. Bernardo’s reference priors generalize this idea to maximize the expected Kullback-Leibler divergence between prior and posterior, formalizing the notion that the data should contribute maximally to inference.
The Bayes factor compares two models by their marginal likelihoods. A Bayes factor greater than 1 favors ; values exceeding 10 or 100 are often described as strong or decisive evidence. The Bayes factor automatically penalizes more complex models — the so-called Occam factor — without requiring an explicit penalty term, making it a principled approach to model selection. Its asymptotic behavior is captured by the Bayesian Information Criterion (BIC): , where is the number of parameters.
Decision Theory and Admissibility
Statistical decision theory, developed largely by Abraham Wald in his 1950 monograph Statistical Decision Functions, provides a unified framework for all of statistics. A decision problem consists of a parameter space , an action space , and a loss function measuring the cost of taking action when the truth is . The risk of a decision rule is its expected loss: .
A rule dominates if for all , with strict inequality for at least one . A rule is admissible if no rule dominates it. Admissibility is a minimal sanity condition: an inadmissible rule can be improved without sacrificing performance anywhere. Remarkably, every admissible rule (under mild conditions) is either a Bayes rule for some prior or a limit of Bayes rules — this is the complete class theorem, which gives Bayesian inference a frequentist justification: Bayesian procedures are the only ones that can be admissible.
The most celebrated result in decision theory is the James-Stein phenomenon. In 1961, Willard James and Charles Stein demonstrated that for dimensions, the sample mean (the natural, unbiased, MLE estimator of the mean of a multivariate normal) is inadmissible under squared error loss. The James-Stein estimator
uniformly dominates as an estimator of the mean vector . This result, which won Stein the Rietz Lecture prize and helped reshape statistical thinking, says that shrinking all coordinates simultaneously toward zero — even if they are measuring unrelated quantities — yields a better estimator. The effect is counterintuitive but mathematically inescapable: “borrowing strength” across coordinates pays dividends.
A minimax decision rule minimizes the worst-case risk: . Minimax rules protect against the worst case; Bayes rules minimize Bayes risk for a given prior . When a Bayes rule has constant risk over , it is simultaneously minimax — a powerful connection between the two paradigms. The least favorable prior — the prior that makes the inference problem hardest — is the prior for which the Bayes risk equals the minimax risk.
Asymptotic Theory and Nonparametric Methods
Large-sample theory makes statistical inference tractable. The three workhorses are the Law of Large Numbers (consistency of sample averages), the Central Limit Theorem (asymptotic normality of sums), and the delta method (asymptotic normality of smooth transformations). If and is differentiable at with , then
This simple observation — that smooth functions of asymptotically normal statistics are asymptotically normal — underpins an enormous portion of applied statistical practice.
Consistency of the MLE under Cramér’s regularity conditions follows from the uniform law of large numbers applied to the average log-likelihood. Under stronger regularity (Wald’s conditions), the MLE is also asymptotically normal with covariance matrix equal to the inverse Fisher information. The theory was made rigorous by Cramér (1946), Wald (1949), and substantially generalized by Le Cam (1953, 1960) through his theory of convergence of experiments, which shows that the MLE achieves the asymptotic minimax lower bound among all estimators.
Nonparametric statistics abandons parametric model assumptions. The empirical distribution function is the canonical nonparametric estimator of the CDF. The Glivenko-Cantelli theorem guarantees almost surely. At a finer scale, the process converges in distribution to a Brownian bridge — the content of Donsker’s theorem (1952), which was the first major result in empirical process theory and remains the engine behind much of modern nonparametric asymptotics.
Kernel density estimation provides a smooth estimate of the unknown density . Given a kernel (a symmetric density) and bandwidth , the estimator is
The mean integrated squared error balances a squared bias term of order and a variance term of order , yielding an optimal bandwidth and minimax rate — slower than the parametric rate, reflecting the cost of not knowing ‘s shape. The theory of such minimax rates — establishing matching upper and lower bounds for estimation in function classes — was developed by Farrell (1972), Stone (1980, 1982), and systematized in the minimax framework of Ibragimov and Khasminskii (1981).
Rank-based tests are entirely nonparametric: the Wilcoxon signed-rank test, the Mann-Whitney U test, and Kruskal-Wallis test replace raw data with ranks, rendering their null distributions free of any distributional assumption. The permutation test is even more fundamental: by computing the test statistic over all possible reassignments of treatment labels, it produces an exact finite-sample -value under the null hypothesis of exchangeability.
High-Dimensional Statistics
Classical asymptotic theory assumes the dimension is fixed as . Modern data challenges this: genomics, imaging, natural language processing, and finance routinely produce datasets where the number of parameters grows with — or even vastly exceeds — . High-dimensional statistics is the rigorous study of inference when or .
In the high-dimensional linear regression problem with , the ordinary least squares estimator does not even exist (the system is underdetermined). The Lasso, introduced by Robert Tibshirani in 1996, imposes an penalty:
The penalty promotes sparsity — setting many coefficients exactly to zero — making it simultaneously a regularizer and a variable selector. Under restricted eigenvalue conditions on and when the true is -sparse, the Lasso achieves the oracle rate , a result due to Bickel, Ritov, and Tsybakov (2009). This rate is minimax optimal up to logarithmic factors.
Concentration inequalities are the analytical backbone of high-dimensional theory. Hoeffding’s inequality bounds the tail of a sum of bounded independent random variables:
where almost surely. Bernstein’s inequality sharpens this when variance is known to be small. Sub-Gaussian and sub-exponential concentration extend these ideas to broader classes of random variables, and are essential for bounding estimation error in high dimensions.
The behavior of the sample covariance matrix in high dimensions is qualitatively different from the classical regime. When , the empirical spectral distribution of the sample covariance converges to the Marchenko-Pastur law — a deterministic distribution that depends only on and the population spectrum. This random matrix theory result, due to Marchenko and Pastur (1967), explains why classical principal component analysis misbehaves in high dimensions: the sample eigenvalues are systematically biased away from the population eigenvalues, and PCA can detect a signal only when the population eigenvalue exceeds the so-called BBP threshold (Baik, Ben Arous, and Péché, 2005).
High-dimensional statistics connects naturally to machine learning theory, information theory, and computational complexity. The phenomenon of phase transitions — sharp boundaries in parameter space between regimes where estimation is possible and where it is not — has been identified in problems ranging from sparse signal recovery (compressed sensing) to community detection in networks. These phase transitions arise from fundamental information-theoretic limits, computed via Fano’s inequality and related tools, that no estimator can overcome regardless of computational cost. The interplay between statistical and computational barriers — what is information-theoretically possible versus what can be achieved with polynomial-time algorithms — is one of the most active frontiers in the field.