Information Theory

Entropy, coding theory, channel capacity, and Kolmogorov complexity.


Information theory is the mathematical science of quantifying, storing, and transmitting information. Founded in a landmark 1948 paper by Claude Shannon, it gave precise answers to questions that had seemed hopelessly vague: how much information is in a message, how efficiently data can be compressed, and how reliably signals can be sent across a noisy channel. In the decades since, the theory has reached into thermodynamics, cryptography, linguistics, machine learning, and quantum physics, making it one of the most powerful cross-disciplinary frameworks in all of applied mathematics.

Entropy and Information Measures

The central concept in information theory is Shannon entropy, a measure of the uncertainty — or, equivalently, the information content — of a probability distribution. To see why entropy is the right measure, start with a single outcome. If a source emits a symbol xx with probability pp, the self-information of that outcome is defined as

I(x)=log2p.I(x) = -\log_2 p.

The logarithm base 2 gives the answer in bits. Outcomes that are very likely carry little surprise and therefore little information; rare outcomes carry a great deal. If p=1p = 1, then I(x)=0I(x) = 0 — a certain event tells you nothing. If p=1/8p = 1/8, then I(x)=3I(x) = 3 bits — learning that outcome occurred is worth three binary questions.

For a discrete random variable XX taking values in an alphabet X\mathcal{X} with probability mass function p(x)=P(X=x)p(x) = P(X = x), the Shannon entropy is the expected self-information:

H(X)=xXp(x)log2p(x),H(X) = -\sum_{x \in \mathcal{X}} p(x) \log_2 p(x),

where by convention 0log20=00 \log_2 0 = 0. Entropy is non-negative, equals zero only when the outcome is certain, and achieves its maximum log2X\log_2 |\mathcal{X}| when XX is uniformly distributed. The binary entropy function h(p)=plog2p(1p)log2(1p)h(p) = -p\log_2 p - (1-p)\log_2(1-p) is the entropy of a biased coin with heads probability pp; it peaks at h(1/2)=1h(1/2) = 1 bit and is zero at p=0p = 0 or p=1p = 1.

The machinery of entropy extends naturally to pairs of random variables. The joint entropy of (X,Y)(X, Y) is H(X,Y)=x,yp(x,y)logp(x,y)H(X, Y) = -\sum_{x,y} p(x,y) \log p(x,y). The conditional entropy H(YX)H(Y \mid X) measures how much uncertainty remains in YY once XX is known:

H(YX)=xp(x)H(YX=x)=x,yp(x,y)logp(yx).H(Y \mid X) = \sum_{x} p(x) H(Y \mid X = x) = -\sum_{x,y} p(x,y) \log p(y \mid x).

These quantities obey the chain rule: H(X,Y)=H(X)+H(YX)H(X, Y) = H(X) + H(Y \mid X). The total uncertainty in the pair equals the uncertainty in XX plus whatever uncertainty in YY remains after observing XX. This decomposes naturally to longer chains: H(X1,,Xn)=i=1nH(XiX1,,Xi1)H(X_1, \ldots, X_n) = \sum_{i=1}^n H(X_i \mid X_1, \ldots, X_{i-1}).

The single most important derived quantity is mutual information, which measures how much knowing XX reduces uncertainty about YY:

I(X;Y)=H(X)+H(Y)H(X,Y)=H(X)H(XY)=H(Y)H(YX).I(X; Y) = H(X) + H(Y) - H(X, Y) = H(X) - H(X \mid Y) = H(Y) - H(Y \mid X).

Mutual information is always non-negative (by the log-sum inequality) and symmetric: I(X;Y)=I(Y;X)I(X; Y) = I(Y; X). It vanishes if and only if XX and YY are statistically independent. This symmetry is remarkable — the information XX carries about YY is exactly the information YY carries about XX.

Also crucial is the Kullback-Leibler divergence (or relative entropy), which measures how different distribution QQ is from a reference PP:

DKL(PQ)=xp(x)logp(x)q(x).D_{KL}(P \| Q) = \sum_x p(x) \log \frac{p(x)}{q(x)}.

Unlike a distance, DKLD_{KL} is not symmetric, but it is always non-negative (Gibbs’ inequality), equaling zero only when P=QP = Q. The KL divergence governs error exponents in hypothesis testing and is intimately connected to mutual information: I(X;Y)=DKL(p(x,y)p(x)p(y))I(X; Y) = D_{KL}(p(x,y) \| p(x)p(y)).

Shannon also connected entropy to thermodynamics. The Boltzmann-Gibbs entropy S=kBipilnpiS = -k_B \sum_i p_i \ln p_i from statistical mechanics has exactly the same form (with kBk_B Boltzmann’s constant and natural logarithm). Shannon explicitly acknowledged this parallel; indeed, the story is told that John von Neumann advised Shannon to call his quantity “entropy” since “nobody knows what entropy really is,” so Shannon would win any argument.

Source Coding and Data Compression

Shannon’s first fundamental theorem, the source coding theorem (1948), answers the question: how compactly can data from a source be encoded, in the limit? The answer is the entropy.

To make this precise, consider a discrete memoryless source emitting i.i.d. symbols X1,X2,X_1, X_2, \ldots from alphabet X\mathcal{X} with distribution pp. The goal is to map sequences of nn symbols to binary strings as short as possible while still being able to recover the original sequence. The source coding theorem states that for any ϵ>0\epsilon > 0 and all sufficiently large nn, there exists a lossless code with average codeword length per symbol no more than H(X)+ϵH(X) + \epsilon bits, and no lossless code can achieve fewer than H(X)H(X) bits per symbol.

The proof hinges on the Asymptotic Equipartition Property (AEP). By the law of large numbers applied to the random variable logp(X)-\log p(X), the quantity 1nlogp(X1,,Xn)-\frac{1}{n} \log p(X_1, \ldots, X_n) converges in probability to H(X)H(X). This means that for large nn, all but a tiny fraction of probability is concentrated on the typical set — the collection of sequences whose empirical probability is approximately 2nH(X)2^{-nH(X)}. The typical set has only about 2nH(X)2^{nH(X)} elements, so nH(X)nH(X) bits suffice to index them. Sequences outside the typical set are so rare that they can be handled with negligible overhead. This argument is Shannon’s elegant shortcut around the combinatorial complexity of all possible sequences.

For finite-length codes, the optimal algorithm in terms of average codeword length is the Huffman code, constructed greedily by repeatedly merging the two least-probable symbols. Huffman coding achieves an average length between H(X)H(X) and H(X)+1H(X) + 1 bit — always within one bit of optimal. The Kraft inequality i2i1\sum_i 2^{-\ell_i} \leq 1 (where i\ell_i are codeword lengths) is the necessary and sufficient condition for the existence of a uniquely decodable prefix code, and the Huffman code saturates this inequality optimally.

When near-perfect compression is needed across long sequences, arithmetic coding performs even better: it assigns a single real number in [0,1)[0,1) to an entire sequence, achieving average codeword length arbitrarily close to H(X)H(X) per symbol without the integer-length constraint that limits Huffman. The Lempel-Ziv algorithms (LZ77 and LZ78, and the derivative LZW used in GIF and ZIP formats) achieve the same entropy limit asymptotically without knowledge of the source distribution — they learn the source on the fly.

So far, all of this is lossless compression. In lossy compression, the goal is to represent the source within an allowable distortion DD while minimizing the bit rate RR. The rate-distortion function R(D)R(D) gives the minimum rate required for distortion level DD, and Shannon’s rate-distortion theorem characterizes it as R(D)=minp(x^x):E[d(X,X^)]DI(X;X^)R(D) = \min_{p(\hat{x}|x): \mathbb{E}[d(X,\hat{X})] \leq D} I(X; \hat{X}). JPEG image compression and MP3 audio encoding are practical realizations of this principle: they sacrifice some fidelity (high-frequency detail invisible or inaudible to humans) to achieve dramatically lower bit rates.

Channel Capacity and Noisy Channel Coding

Shannon’s second fundamental theorem is even more surprising: no matter how noisy a communication channel is, as long as it transmits any information at all, reliable communication is possible at some positive rate. The maximum such rate is the channel capacity CC.

A discrete memoryless channel (DMC) is described by an input alphabet X\mathcal{X}, output alphabet Y\mathcal{Y}, and a conditional distribution p(yx)p(y \mid x) specifying the probability of receiving output yy when input xx is sent. Each channel use is independent of all others (hence “memoryless”). The channel capacity is:

C=maxp(x)I(X;Y),C = \max_{p(x)} I(X; Y),

where the maximum is over all input distributions. Capacity is measured in bits per channel use.

For the binary symmetric channel (BSC) with crossover probability ϵ\epsilon (each bit is flipped independently with probability ϵ\epsilon), the capacity is C=1h(ϵ)C = 1 - h(\epsilon), where hh is the binary entropy function. When ϵ=0\epsilon = 0, the channel is noiseless and C=1C = 1 bit. When ϵ=1/2\epsilon = 1/2, the output is independent of the input and C=0C = 0. For the additive white Gaussian noise (AWGN) channel Y=X+ZY = X + Z with ZN(0,N)Z \sim \mathcal{N}(0, N) and power constraint E[X2]P\mathbb{E}[X^2] \leq P, the famous Shannon-Hartley theorem gives:

C=12log2 ⁣(1+PN)bits per channel use.C = \frac{1}{2} \log_2\!\left(1 + \frac{P}{N}\right) \quad \text{bits per channel use.}

This formula, often written in terms of bandwidth WW as C=Wlog2(1+P/(N0W))C = W \log_2(1 + P/(N_0 W)), is the foundation of all modern wireless communication design.

The noisy-channel coding theorem states: for any rate R<CR < C, there exist codes of length nn with 2nR2^{nR} codewords such that the maximum probability of decoding error goes to zero as nn \to \infty. Conversely, for any rate R>CR > C, the error probability is bounded away from zero for any code. Shannon’s proof was non-constructive — he showed that a randomly chosen codebook works in expectation — and it took decades to find explicit codes that approach capacity.

Fano’s inequality makes the converse direction precise: if we try to communicate at rate R>CR > C, then any decoder must incur a probability of error PeP_e satisfying H(Pe)+Pelog(M1)H(MYn)H(P_e) + P_e \log(|\mathcal{M}| - 1) \geq H(M \mid Y^n), which grows without bound as nn increases. This elegant inequality links the error rate directly to the conditional entropy of the message given the received sequence, unifying the probabilistic and information-theoretic perspectives on communication limits.

Error-Correcting Codes

Shannon’s coding theorem is existential — it proves that good codes exist without building them. The field of coding theory constructs explicit codes and efficient decoding algorithms that approach the theoretical limit.

A block code of length nn over an alphabet of size qq is a set CXn\mathcal{C} \subseteq \mathcal{X}^n of codewords. The code rate is R=logqCnR = \frac{\log_q |\mathcal{C}|}{n} symbols per channel use. Error correction relies on the Hamming distance between codewords — the number of positions in which they differ. A code with minimum Hamming distance dmind_{\min} can detect up to dmin1d_{\min} - 1 errors and correct up to (dmin1)/2\lfloor (d_{\min}-1)/2 \rfloor errors.

Linear codes over a finite field Fq\mathbb{F}_q are subspaces of Fqn\mathbb{F}_q^n; they are described by a generator matrix GG (whose rows span the code) or equivalently by a parity-check matrix HH (whose null space is the code). A linear [n,k,d][n, k, d] code encodes kk message symbols into nn channel symbols with minimum distance dd. The Hamming codes — constructed by Richard Hamming in 1950 while he was frustrated by errors on punch-card computers — are perfect linear codes: every received word is within distance 1 of exactly one codeword, so single-bit errors are always correctable and no redundancy is wasted. The Hamming [7,4,3][7,4,3] code adds 3 parity bits to 4 message bits, enabling single-error correction.

Reed-Solomon codes, introduced by Irving Reed and Gustave Solomon in 1960, are codes over large alphabets (F2m\mathbb{F}_{2^m}) whose codewords are evaluations of low-degree polynomials. An [n,k,nk+1][n, k, n-k+1] Reed-Solomon code achieves the Singleton bound dnk+1d \leq n - k + 1 with equality — it is a maximum distance separable (MDS) code, packing the maximum possible distance into a given rate. Reed-Solomon codes are deployed universally: they protect data on CDs and DVDs (correcting entire burst errors from scratches), safeguard data in RAID storage, and underlie QR codes. The algebraic structure that makes them powerful — polynomial interpolation over finite fields — also makes decoding tractable via the Berlekamp-Massey algorithm.

Modern coding theory shifted dramatically in the 1990s with the discovery of turbo codes (Berrou and Glavieux, 1993) and the rediscovery of low-density parity-check (LDPC) codes (originally due to Gallager in 1962, but impractical then). Both achieve rates within fractions of a decibel of the Shannon limit. LDPC codes are defined by sparse parity-check matrices represented as bipartite Tanner graphs, and they are decoded by belief propagation — an iterative message-passing algorithm on the graph. More recently, polar codes (Arikan, 2009) achieve capacity with a provably optimal successive-cancellation decoder and are used in 5G wireless standards, marking the first time a capacity-achieving code made it from theory to global deployment.

Network Information Theory

Classical information theory treats point-to-point communication: one sender, one receiver. Network information theory extends the framework to multi-terminal settings, where multiple senders and receivers share a common medium, and the results are far richer — and far less complete.

The multiple access channel (MAC) models many senders transmitting to a single receiver, as in a cellular uplink. The fundamental object is the capacity region — the set of simultaneously achievable rate pairs (R1,R2)(R_1, R_2). For the two-user Gaussian MAC, this is a pentagon characterized by the constraints R1I(X1;YX2)R_1 \leq I(X_1; Y \mid X_2), R2I(X2;YX1)R_2 \leq I(X_2; Y \mid X_1), and R1+R2I(X1,X2;Y)R_1 + R_2 \leq I(X_1, X_2; Y). The sum-rate constraint shows that the two users compete for the same channel resource, but the corner points of the capacity region are achieved by successive interference cancellation — the receiver decodes one user’s message and subtracts it before decoding the other.

The broadcast channel is the reverse: one sender, multiple receivers. Despite being the dual of the MAC, characterizing its capacity region is much harder. For the degraded broadcast channel (where the channel outputs form a Markov chain), Cover proved that superposition coding is optimal — the sender encodes a “cloud” for the weaker receiver and a refinement for the stronger one, nested like layers of an onion.

The Slepian-Wolf theorem (1973) is a landmark result in distributed source coding. Suppose two correlated sources XX and YY are encoded separately and decoded jointly (as in a sensor network where sensors cannot communicate). Slepian and Wolf proved that the minimum total rate needed is H(X,Y)H(X, Y), the joint entropy — no more than if the encoders could cooperate. Even though neither encoder sees the other’s data, the decoder can exploit correlations. The rates need only satisfy RXH(XY)R_X \geq H(X \mid Y), RYH(YX)R_Y \geq H(Y \mid X), and RX+RYH(X,Y)R_X + R_Y \geq H(X, Y) — a rate region whose boundary traces the Slepian-Wolf corner.

Network coding, pioneered by Ahlswede, Cai, Li, and Yeung in 2000, overturns the classical routing paradigm. In traditional networks, intermediate nodes only forward packets. Network coding allows them to transmit coded combinations of received packets, dramatically increasing throughput. The butterfly network example — where two sources share a bottleneck link to reach two sinks — shows that linear network coding achieves the min-cut capacity while routing alone cannot. The capacity of a single-source multicast network equals the minimum of the min-cuts from the source to each sink, achievable with linear coding over sufficiently large finite fields.

Kolmogorov Complexity and Algorithmic Information

Shannon entropy characterizes the average information in an ensemble — a collection of objects drawn from a known distribution. But what is the information content of a single, specific string? This question motivated Kolmogorov complexity, developed independently in the early 1960s by Andrei Kolmogorov, Ray Solomonoff, and Gregory Chaitin.

The Kolmogorov complexity K(x)K(x) of a finite binary string xx is the length of the shortest program for a fixed universal Turing machine U\mathcal{U} that outputs xx and halts:

K(x)=min{π:U(π)=x}.K(x) = \min\{ |\pi| : \mathcal{U}(\pi) = x \}.

Intuitively, K(x)K(x) measures how compressible xx is — how concisely it can be described. A string of a million zeros has very low Kolmogorov complexity (the program “print ‘0’ one million times” is short). A truly random string has Kolmogorov complexity close to its own length, since no program shorter than xx can generate it.

The invariance theorem guarantees that the choice of universal Turing machine matters only up to a constant: if KUK_{\mathcal{U}} and KVK_{\mathcal{V}} are complexities with respect to two different universal machines, then KU(x)KV(x)c|K_{\mathcal{U}}(x) - K_{\mathcal{V}}(x)| \leq c for some constant cc independent of xx. This makes Kolmogorov complexity a machine-independent concept for sufficiently long strings.

The connection between Kolmogorov complexity and Shannon entropy is deep. For a random variable XX with distribution pp, the expected Kolmogorov complexity satisfies E[K(X)]H(X)\mathbb{E}[K(X)] \approx H(X) up to logarithmic additive terms. Kolmogorov complexity thereby generalizes entropy from ensembles to individual objects, providing an objective, distribution-free notion of information content.

A string xx is Martin-Löf random if it passes all effective statistical tests for randomness — equivalently, if K(x)xcK(x) \geq |x| - c for a constant cc. Most strings are random in this sense (the incompressible ones dominate), yet no specific random string can be certified random, because the very definition is non-constructive. Chaitin’s constant Ω=U(π) halts2π\Omega = \sum_{\mathcal{U}(\pi) \text{ halts}} 2^{-|\pi|} — the halting probability of a universal machine — is the canonical example: its bits are maximally random, yet the number itself is well-defined. Knowing finitely many bits of Ω\Omega would solve the halting problem for programs up to the corresponding length.

The minimum description length (MDL) principle, rooted in this theory, provides a principled framework for statistical model selection: the best model for data is the one that minimizes the total description length of the model plus the data given the model. MDL formalizes Occam’s razor in information-theoretic terms and connects to Bayesian model comparison with Solomonoff’s universal prior — a probability distribution over all computable sequences weighted by their Kolmogorov complexity.

Quantum Information Theory

Classical information theory assumes that information is encoded in classical bits — physical systems that are definitively 0 or 1. Quantum information theory explores what happens when information is encoded in quantum states, governed by the laws of quantum mechanics. The field was energized by Peter Shor’s 1994 quantum factoring algorithm and has developed into a rich discipline spanning computation, communication, and cryptography.

The quantum analogue of a bit is the qubit — a two-level quantum system whose state is a unit vector ψ=α0+β1|\psi\rangle = \alpha|0\rangle + \beta|1\rangle in a two-dimensional complex Hilbert space, with α2+β2=1|\alpha|^2 + |\beta|^2 = 1. Unlike a classical bit, a qubit can be in a superposition of 0|0\rangle and 1|1\rangle, but measurement collapses it irreversibly to one of the basis states with probabilities α2|\alpha|^2 and β2|\beta|^2. Multiple qubits can be entangled — their joint state cannot be written as a product of individual states, so measuring one qubit instantly constrains the outcomes of measuring the others, regardless of their spatial separation.

The quantum analogue of Shannon entropy is the von Neumann entropy, defined for a quantum state described by density matrix ρ\rho:

S(ρ)=tr(ρlog2ρ)=iλilog2λi,S(\rho) = -\text{tr}(\rho \log_2 \rho) = -\sum_i \lambda_i \log_2 \lambda_i,

where λi\lambda_i are the eigenvalues of ρ\rho. For a pure state ρ=ψψ\rho = |\psi\rangle\langle\psi|, all eigenvalues except one are zero, and S(ρ)=0S(\rho) = 0 — a pure state contains no classical uncertainty. For the maximally mixed state ρ=I/d\rho = I/d on a dd-dimensional system, S(ρ)=log2dS(\rho) = \log_2 d. Von Neumann entropy governs entanglement: for a bipartite pure state on ABAB, the entanglement entropy is S(ρA)S(\rho_A), where ρA\rho_A is the reduced state of subsystem AA.

The quantum channel capacity is more subtle than its classical counterpart. A quantum channel N\mathcal{N} (formally, a completely positive trace-preserving map) has a quantum capacity Q(N)Q(\mathcal{N}) for transmitting qubits reliably, a classical capacity C(N)C(\mathcal{N}) for transmitting classical bits, and an entanglement-assisted classical capacity CE(N)=maxρI(A;B)σC_E(\mathcal{N}) = \max_\rho I(A; B)_\sigma (the quantum mutual information, maximized over input states). Strikingly, the entanglement-assisted capacity equals log2dA+S(N)\log_2 d_A + S(\mathcal{N}) for an appropriate entropy of the channel, and it achieves a single-letter formula — unlike the unassisted quantum capacity, which generally requires regularization over many channel uses and remains poorly understood for even simple channels.

Quantum error correction must contend with continuous errors (any unitary perturbation is possible) and the no-cloning theorem (quantum states cannot be copied). The resolution, due to Shor (1995) and Steane (1996), is that quantum codes do not copy states but instead spread their information across entangled multi-qubit systems, allowing errors to be detected and corrected without ever learning the encoded state directly. The 9-qubit Shor code encodes 1 logical qubit into 9 physical qubits, protecting against any single-qubit error. Modern stabilizer codes and topological codes (including the toric code) achieve much higher efficiency.

Quantum key distribution (QKD), exemplified by the BB84 protocol (Bennett and Brassard, 1984), uses the laws of physics to guarantee information-theoretic security. Any eavesdropper on the quantum channel necessarily disturbs the quantum states — a fact detectable by the legitimate parties. The security of QKD is not based on computational hardness assumptions (unlike RSA) but on the fundamental constraints of quantum mechanics, making it secure even against adversaries with unlimited computational power, including quantum computers.