Probability Theory

Measure-theoretic foundations, limit theorems, and stochastic processes.


Probability theory is the mathematical language of uncertainty — the branch of mathematics that gives rigorous meaning to the intuitive notion that some outcomes are more likely than others. Built on the foundations of measure theory, it transforms vague talk of chance into a precise, powerful framework capable of handling everything from coin flips to the fluctuations of financial markets. The modern measure-theoretic approach, developed by Andrei Kolmogorov in his 1933 monograph Grundbegriffe der Wahrscheinlichkeitsrechnung, unified a century of fragmented probabilistic thinking and opened the door to the rich world of stochastic processes we inhabit today.

Probability Spaces and Measure Theory

The starting point of rigorous probability is the construction of a probability space, a triple (Ω,F,P)(\Omega, \mathcal{F}, P) that encodes everything about a random experiment. The sample space Ω\Omega is the set of all possible outcomes — for a coin flip, Ω={H,T}\Omega = \{H, T\}; for a random real number in [0,1][0,1], Ω=[0,1]\Omega = [0,1]. The sigma-algebra F\mathcal{F} is a collection of subsets of Ω\Omega called events, closed under complementation and countable unions, which represents the class of questions we are allowed to ask about our experiment. The probability measure P:F[0,1]P: \mathcal{F} \to [0,1] assigns a probability to each event, satisfying Kolmogorov’s axioms: P(Ω)=1P(\Omega) = 1 and countable additivity, meaning P ⁣(n=1An)=n=1P(An)P\!\left(\bigcup_{n=1}^\infty A_n\right) = \sum_{n=1}^\infty P(A_n) for pairwise disjoint events AnA_n.

The requirement of a sigma-algebra is not a pedantic technicality — it reflects a genuine mathematical necessity. When Ω\Omega is uncountable (as in any continuous model), not every subset can be assigned a probability in a consistent way; the existence of non-measurable sets, established by Giuseppe Vitali in 1905 and later dramatized by the Banach-Tarski paradox, shows that naive set-theoretic probability leads to contradictions. The sigma-algebra F\mathcal{F} is precisely the universe of well-behaved subsets we agree to work with. For models on the real line, the standard choice is the Borel sigma-algebra B(R)\mathcal{B}(\mathbb{R}), generated by all open intervals — a vast but tractable collection.

The Carathéodory extension theorem is the engine behind probability space construction. It says that to define a probability measure on all of F\mathcal{F}, it suffices to define a consistent pre-measure on a simpler generating class — such as the collection of all half-open intervals [a,b)[a,b) — and Carathéodory’s machinery guarantees a unique extension to the full sigma-algebra. This is how the uniform measure on [0,1][0,1], and more generally Lebesgue measure, is rigorously constructed. The theorem likewise underpins the Kolmogorov extension theorem, which constructs probability measures on infinite product spaces. Given a consistent family of finite-dimensional distributions, the Kolmogorov extension theorem guarantees the existence of a single probability measure on R\mathbb{R}^\infty (or any infinite product of Polish spaces) that has all specified marginals — this is the foundational guarantee behind the existence of stochastic processes.

Random Variables and Distributions

A random variable is not a variable in the algebraic sense, nor is it random in the colloquial sense: it is a measurable function X:ΩRX: \Omega \to \mathbb{R}, meaning that for every Borel set BRB \subseteq \mathbb{R}, the preimage X1(B)={ωΩ:X(ω)B}X^{-1}(B) = \{\omega \in \Omega : X(\omega) \in B\} belongs to F\mathcal{F}. This measurability condition ensures that statements like "X3X \leq 3" or "X[1,2]X \in [1,2]" are genuine events with well-defined probabilities. The random variable XX induces a probability measure μX\mu_X on (R,B(R))(\mathbb{R}, \mathcal{B}(\mathbb{R})), called its distribution or law, defined by μX(B)=P(X1(B))\mu_X(B) = P(X^{-1}(B)).

The distribution of a random variable is completely characterized by its cumulative distribution function (CDF) FX(x)=P(Xx)F_X(x) = P(X \leq x). Every CDF is right-continuous, non-decreasing, and satisfies FX()=0F_X(-\infty) = 0 and FX(+)=1F_X(+\infty) = 1. Conversely, every function with these three properties is the CDF of some random variable. The Lebesgue decomposition theorem tells us that any distribution decomposes uniquely into a discrete part (a countable weighted sum of point masses), an absolutely continuous part (specified by a probability density function ff satisfying P(XB)=Bf(x)dxP(X \in B) = \int_B f(x)\,dx), and a singular continuous part that is diffuse yet assigns measure zero to every countable set — exemplified by the exotic Cantor distribution.

Among the most important families: the Gaussian (or normal) distribution N(μ,σ2)N(\mu, \sigma^2) with density f(x)=1σ2πe(xμ)22σ2f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} is the cornerstone of probability, arising naturally as a limit in the central limit theorem and appearing throughout physics, statistics, and finance. The Poisson distribution with parameter λ>0\lambda > 0 gives P(X=k)=eλλk/k!P(X = k) = e^{-\lambda}\lambda^k / k! and models the number of rare events in a fixed period. The exponential distribution with rate λ\lambda has density f(x)=λeλxf(x) = \lambda e^{-\lambda x} for x0x \geq 0 and is the unique continuous distribution with the memoryless property P(X>s+tX>s)=P(X>t)P(X > s+t \mid X > s) = P(X > t).

Expectation and Integration

The expected value (or mathematical expectation) of a random variable XX on a probability space (Ω,F,P)(\Omega, \mathcal{F}, P) is defined as a Lebesgue integral:

E[X]=ΩX(ω)dP(ω).E[X] = \int_\Omega X(\omega)\,dP(\omega).

This unifies the classical formulas: for a discrete random variable, E[X]=kkP(X=k)E[X] = \sum_k k \cdot P(X = k); for a continuous one with density ff, E[X]=xf(x)dxE[X] = \int_{-\infty}^\infty x f(x)\,dx. The integration is defined in stages — first for non-negative simple functions (finite weighted sums of indicator functions), then for general non-negative measurable functions via approximation, then for integrable functions by writing X=X+XX = X^+ - X^- where X+=max(X,0)X^+ = \max(X,0) and X=max(X,0)X^- = \max(-X,0).

Three convergence theorems are the workhorses of the theory. The Monotone Convergence Theorem (Levi, 1906) states that if XnXX_n \nearrow X pointwise with Xn0X_n \geq 0, then E[Xn]E[X]E[X_n] \to E[X] — one may exchange limit and integral when the sequence is non-decreasing. Fatou’s Lemma provides a one-sided bound for general sequences: E[lim infnXn]lim infnE[Xn]E[\liminf_n X_n] \leq \liminf_n E[X_n]. The crown jewel is the Dominated Convergence Theorem (Lebesgue): if XnXX_n \to X pointwise and XnY|X_n| \leq Y for some integrable dominating function YY, then E[Xn]E[X]E[X_n] \to E[X]. These theorems underlie virtually every interchange of limit and expectation in probability theory.

The variance of XX measures the spread of its distribution: Var(X)=E[(XE[X])2]=E[X2](E[X])2\text{Var}(X) = E[(X - E[X])^2] = E[X^2] - (E[X])^2. The fundamental Chebyshev inequality P(Xμt)σ2/t2P(|X - \mu| \geq t) \leq \sigma^2/t^2 (where μ=E[X]\mu = E[X] and σ2=Var(X)\sigma^2 = \text{Var}(X)) bounds the probability of deviating far from the mean using only mean and variance, requiring no knowledge of the distribution’s shape. Sharper bounds come from Jensen’s inequality: if φ\varphi is convex, then φ(E[X])E[φ(X)]\varphi(E[X]) \leq E[\varphi(X)]. This single inequality simultaneously implies the AM-GM inequality, the Cauchy-Schwarz inequality for expectations, and the non-negativity of KL-divergence in information theory.

Conditional Expectation and Martingales

Conditional expectation is one of the deepest and most powerful concepts in measure-theoretic probability. Given a random variable XL1(Ω,F,P)X \in L^1(\Omega, \mathcal{F}, P) and a sub-sigma-algebra GF\mathcal{G} \subseteq \mathcal{F}, the conditional expectation E[XG]E[X \mid \mathcal{G}] is the unique G\mathcal{G}-measurable random variable ZZ such that GZdP=GXdP\int_G Z\,dP = \int_G X\,dP for all GGG \in \mathcal{G}. Its existence is guaranteed by the Radon-Nikodym theorem: ZdPZ\,dP is the Radon-Nikodym derivative of the signed measure AAXdPA \mapsto \int_A X\,dP with respect to the restriction of PP to G\mathcal{G}. Conditional expectation is best understood as an orthogonal projection in the Hilbert space L2(Ω,F,P)L^2(\Omega, \mathcal{F}, P): it projects XX onto the closed subspace of G\mathcal{G}-measurable functions, minimizing E[(XZ)2]E[(X-Z)^2] over all G\mathcal{G}-measurable ZZ.

The tower property (or law of total expectation) is the most used property: if HG\mathcal{H} \subseteq \mathcal{G}, then E[E[XG]H]=E[XH]E[E[X \mid \mathcal{G}] \mid \mathcal{H}] = E[X \mid \mathcal{H}]. In words, conditioning further on less information always dominates. The law of total variance Var(X)=E[Var(XG)]+Var(E[XG])\text{Var}(X) = E[\text{Var}(X \mid \mathcal{G})] + \text{Var}(E[X \mid \mathcal{G}]) decomposes variance into within-group and between-group components, a formula that underlies ANOVA in statistics and the bias-variance tradeoff in machine learning.

A martingale is a stochastic process (Mn)n0(M_n)_{n \geq 0} adapted to a filtration (Fn)(\mathcal{F}_n) — an increasing sequence of sigma-algebras representing the information available at each time — satisfying E[Mn+1Fn]=MnE[M_{n+1} \mid \mathcal{F}_n] = M_n for all nn. This says the best prediction of tomorrow’s value, given everything known today, is today’s value: martingales model fair games. The concept was formalized in the 1930s by Paul Lévy and later systematized by Joseph Doob, whose 1953 treatise Stochastic Processes is a landmark of twentieth-century mathematics. A submartingale satisfies E[Mn+1Fn]MnE[M_{n+1} \mid \mathcal{F}_n] \geq M_n (favorable game); a supermartingale satisfies \leq (unfavorable).

Doob’s optional stopping theorem is a central result: under suitable integrability conditions, for a stopping time τ\tau (a random time determined by the filtration), E[Mτ]=E[M0]E[M_\tau] = E[M_0]. This makes precise the intuition that no gambling strategy can beat a fair game. Doob’s martingale convergence theorem states that every non-negative supermartingale converges almost surely to an integrable limit — a profound result whose proof uses the elegant upcrossing inequality.

Independence and Limit Theorems

Two events AA and BB are independent if P(AB)=P(A)P(B)P(A \cap B) = P(A)P(B). More generally, a family of sub-sigma-algebras {Gi}iI\{\mathcal{G}_i\}_{i \in I} is independent if for any finite subcollection Gi1,,Gik\mathcal{G}_{i_1}, \ldots, \mathcal{G}_{i_k} and events GjGijG_j \in \mathcal{G}_{i_j},

P(Gi1Gik)=P(Gi1)P(Gik).P(G_{i_1} \cap \cdots \cap G_{i_k}) = P(G_{i_1}) \cdots P(G_{i_k}).

Random variables are independent if the sigma-algebras they generate are independent; equivalently, if their joint distribution is the product of their marginals. Note that pairwise independence does not imply mutual independence — this distinction, sometimes surprising to newcomers, is a recurring source of subtlety throughout probability.

The Borel-Cantelli lemmas give precise conditions for infinitely many events to occur. The first lemma states: if nP(An)<\sum_n P(A_n) < \infty, then P(lim supnAn)=0P(\limsup_n A_n) = 0 — almost surely, only finitely many AnA_n occur. The second (requiring independence) states: if the AnA_n are independent and nP(An)=\sum_n P(A_n) = \infty, then P(lim supnAn)=1P(\limsup_n A_n) = 1 — infinitely many occur almost surely. Kolmogorov’s 0-1 law is a striking consequence of independence: every event in the tail sigma-algebra T=n=1σ(Xn+1,Xn+2,)\mathcal{T} = \bigcap_{n=1}^\infty \sigma(X_{n+1}, X_{n+2}, \ldots) of an independent sequence has probability zero or one. Events like “the series Xn\sum X_n converges” or “the sequence lim supXn/logn\limsup X_n / \log n exists finitely” are tail events and therefore have trivial probabilities.

The Weak Law of Large Numbers (Khintchine, 1929) states that for independent identically distributed random variables X1,X2,X_1, X_2, \ldots with finite mean μ\mu, the sample average converges in probability: Xˉn=1nk=1nXkPμ\bar{X}_n = \frac{1}{n}\sum_{k=1}^n X_k \xrightarrow{P} \mu. The Strong Law of Large Numbers strengthens this to almost sure convergence: Xˉna.s.μ\bar{X}_n \xrightarrow{a.s.} \mu, meaning the exceptional set of outcomes where convergence fails has probability zero. The standard proof proceeds via Kolmogorov’s maximal inequality and a truncation argument; the SLLN requires only finite mean (unlike some weaker forms that require finite variance).

The Central Limit Theorem (CLT) is the deepest of the classical limit theorems. For i.i.d. random variables with mean μ\mu and variance σ2(0,)\sigma^2 \in (0, \infty),

1σnk=1n(Xkμ)dN(0,1).\frac{1}{\sigma\sqrt{n}}\sum_{k=1}^n (X_k - \mu) \xrightarrow{d} N(0,1).

The distribution of the standardized sum converges to the standard Gaussian as nn \to \infty, regardless of the underlying distribution of the XkX_k. The earliest versions were proved by Abraham de Moivre (1733, for Bernoulli variables) and Pierre-Simon Laplace (1812, for sums of many small errors). The modern form was established by Aleksandr Lyapunov (1901) and refined through the Lindeberg condition (1922), which gives the sharp criterion for the CLT to hold for triangular arrays of independent (but not necessarily identically distributed) random variables: no single term should dominate the sum. Berry-Esseen bounds make the convergence quantitative, giving the error Fn(x)Φ(x)Cρ/(σ3n)|F_n(x) - \Phi(x)| \leq C \rho / (\sigma^3 \sqrt{n}) where ρ=E[X3]\rho = E[|X|^3].

Characteristic Functions and Transforms

The characteristic function of a random variable XX is defined by φX(t)=E[eitX]\varphi_X(t) = E[e^{itX}] for tRt \in \mathbb{R}, where i=1i = \sqrt{-1}. It is the Fourier transform of the distribution μX\mu_X. Every characteristic function is uniformly continuous, bounded by 1, and satisfies φX(0)=1\varphi_X(0) = 1. The fundamental theorem is the uniqueness theorem: two distributions are equal if and only if their characteristic functions agree. This allows one to identify distributions by computing a single function rather than comparing measures on all Borel sets.

The Lévy continuity theorem (or Lévy-Cramér theorem) is the key tool for proving convergence in distribution: a sequence of distributions μn\mu_n converges weakly to μ\mu if and only if the corresponding characteristic functions φn(t)\varphi_n(t) converge pointwise to φ(t)\varphi(t), and the limit φ\varphi is continuous at 0. This reduces the CLT proof to a computation: for i.i.d. variables with μ=0\mu = 0 and σ2=1\sigma^2 = 1, one expands φX(t/n)n(1t2/(2n))net2/2\varphi_X(t/\sqrt{n})^n \approx (1 - t^2/(2n))^n \to e^{-t^2/2}, which is exactly the characteristic function of N(0,1)N(0,1).

The moment generating function (MGF) MX(t)=E[etX]M_X(t) = E[e^{tX}] exists when E[etX]<E[e^{tX}] < \infty in a neighborhood of zero, and when it exists it uniquely determines the distribution and satisfies E[Xk]=MX(k)(0)E[X^k] = M_X^{(k)}(0) — hence “moment generating.” The cumulant generating function KX(t)=logMX(t)K_X(t) = \log M_X(t) has the property that its Taylor coefficients, the cumulants κn\kappa_n, satisfy κ1=E[X]\kappa_1 = E[X] (mean), κ2=Var(X)\kappa_2 = \text{Var}(X), and higher cumulants measure departures from normality (skewness is related to κ3\kappa_3, kurtosis to κ4\kappa_4). For independent random variables, cumulants add: KX+Y(t)=KX(t)+KY(t)K_{X+Y}(t) = K_X(t) + K_Y(t), which gives an elegant way to track how distributions evolve under convolution. The Laplace transform LX(s)=E[esX]\mathcal{L}_X(s) = E[e^{-sX}] plays a similar role for non-negative random variables and is the primary tool in renewal theory and queuing theory.

Brownian Motion and Stochastic Processes

Brownian motion (also called the Wiener process) is the central object of continuous-time probability theory. A standard Brownian motion (Bt)t0(B_t)_{t \geq 0} is a stochastic process characterized by four properties: B0=0B_0 = 0 almost surely; independent increments — for 0s<t0 \leq s < t, the increment BtBsB_t - B_s is independent of Fs\mathcal{F}_s; stationary Gaussian increments — BtBsN(0,ts)B_t - B_s \sim N(0, t-s); and continuous paths — almost surely, the function tBtt \mapsto B_t is continuous. The process is named after botanist Robert Brown, who observed the erratic motion of pollen particles in water in 1827, but its mathematical existence was established by Norbert Wiener in 1923. Bachelier’s 1900 thesis Théorie de la Spéculation had already used Brownian motion to model stock prices — the first mathematical model of a financial market, decades before Wiener’s rigorous construction.

The existence of Brownian motion is non-trivial: paths are required to be continuous, yet the increments are Gaussian — how do we know there is a probability space supporting such an object? Wiener’s construction uses Fourier series; a more elegant approach uses the Kolmogorov extension theorem to define consistent finite-dimensional Gaussian distributions, then invokes the Kolmogorov continuity theorem (if E[XtXsα]Cts1+βE[|X_t - X_s|^\alpha] \leq C|t-s|^{1+\beta} for some α,β,C>0\alpha, \beta, C > 0, then the process has a continuous modification) to obtain a version with continuous paths.

Brownian motion has remarkable and paradoxical path properties. Almost surely, Brownian paths are nowhere differentiable — they oscillate so wildly that no tangent line exists at any point, a fact proved by Paley, Wiener, and Zygmund in 1933. Yet they are Hölder continuous of every exponent less than 1/21/2: for every γ<1/2\gamma < 1/2, there exists C(ω)>0C(\omega) > 0 such that BtBsCtsγ|B_t - B_s| \leq C|t-s|^\gamma for all s,ts,t in a compact interval. The quadratic variation of Brownian motion is the central fact distinguishing stochastic calculus from ordinary calculus: along any sequence of partitions whose mesh tends to zero, i(Bti+1Bti)2t\sum_i (B_{t_{i+1}} - B_{t_i})^2 \to t in probability. This is written [B,B]t=t[B,B]_t = t, and it means that Brownian motion accumulates variation at a rate of one per unit time — its paths are of bounded quadratic variation but unbounded first variation.

Brownian motion is simultaneously a martingale (by independence of increments), a Markov process, and a Gaussian process. It satisfies the reflection principle: the distribution of max0stBs\max_{0 \leq s \leq t} B_s equals the distribution of Bt|B_t|, so P(maxstBsx)=2P(Btx)P(\max_{s \leq t} B_s \geq x) = 2P(B_t \geq x) for x>0x > 0. Among the more surprising distributional facts is Lévy’s arcsine law (1939): the fraction of time Brownian motion spends positive on [0,1][0,1] has an arcsine distribution on [0,1][0,1], not the uniform distribution one might naively expect.

Stochastic Integration and SDEs

The pathwise irregularity of Brownian motion prevents defining 0tf(s)dBs\int_0^t f(s)\,dB_s as a Riemann-Stieltjes integral — Brownian paths have infinite first variation, so the classical theory does not apply. Kiyosi Itô resolved this in 1944 by defining a new integral based on the L2L^2 structure of the process rather than pathwise approximation. For a predictable (or adapted, left-continuous) process HH, the Itô integral 0tHsdBs\int_0^t H_s\,dB_s is first defined for simple processes (step functions in time) and then extended to all square-integrable adapted processes using the Itô isometry:

E ⁣[(0tHsdBs)2]=E ⁣[0tHs2ds].E\!\left[\left(\int_0^t H_s\,dB_s\right)^2\right] = E\!\left[\int_0^t H_s^2\,ds\right].

This isometry identifies the stochastic integral as a natural isometry from L2([0,t]×Ω)L^2([0,t] \times \Omega) to L2(Ω)L^2(\Omega), making it a well-defined element of a Hilbert space. The resulting integral 0tHsdBs\int_0^t H_s\,dB_s is itself a continuous martingale.

Itô’s formula (or Itô’s lemma) is the change-of-variables rule for stochastic calculus. If f:RRf: \mathbb{R} \to \mathbb{R} is twice continuously differentiable and Xt=X0+0tbsds+0tσsdBsX_t = X_0 + \int_0^t b_s\,ds + \int_0^t \sigma_s\,dB_s is an Itô process, then:

f(Xt)=f(X0)+0tf(Xs)dXs+120tf(Xs)d[X,X]s.f(X_t) = f(X_0) + \int_0^t f'(X_s)\,dX_s + \frac{1}{2}\int_0^t f''(X_s)\,d[X,X]_s.

The extra second-order term 12f(Xs)σs2ds\frac{1}{2}f''(X_s)\sigma_s^2\,ds — absent in ordinary calculus — arises precisely from the non-zero quadratic variation of XX. This term is the mathematical source of the “convexity correction” in finance and of the Laplacian term in the connection between Brownian motion and PDEs.

A stochastic differential equation (SDE) takes the form dXt=b(t,Xt)dt+σ(t,Xt)dBtdX_t = b(t, X_t)\,dt + \sigma(t, X_t)\,dB_t, which is shorthand for the integral equation Xt=X0+0tb(s,Xs)ds+0tσ(s,Xs)dBsX_t = X_0 + \int_0^t b(s,X_s)\,ds + \int_0^t \sigma(s,X_s)\,dB_s. Under Lipschitz and linear growth conditions on bb and σ\sigma, Picard iteration guarantees the existence of a unique strong solution — a continuous, adapted process solving the equation pathwise. SDEs model diffusion processes: geometric Brownian motion dSt=μStdt+σStdBtdS_t = \mu S_t\,dt + \sigma S_t\,dB_t (Black-Scholes model), the Ornstein-Uhlenbeck process dXt=θXtdt+σdBtdX_t = -\theta X_t\,dt + \sigma\,dB_t (mean-reverting noise), and the CIR and Heston models of interest rate and volatility dynamics. The Girsanov theorem — which says that a Brownian motion with drift μt\mu_t can be transformed into a standard Brownian motion by an explicit change of measure given by the Radon-Nikodym derivative exp ⁣(0TμtdBt120Tμt2dt)\exp\!\left(\int_0^T \mu_t\,dB_t - \frac{1}{2}\int_0^T \mu_t^2\,dt\right) — is the foundation of risk-neutral pricing in mathematical finance and of the Feynman-Kac formula connecting SDEs to parabolic PDEs.

The development of stochastic calculus marked a qualitative shift in what probability theory could address: from the classical limit theorems of independent variables, it moved into the continuous-time dynamics of interacting random systems, opening pathways to mathematical finance, the theory of random PDEs, and the modern understanding of non-equilibrium statistical mechanics.