Information Theory
Entropy, coding theory, channel capacity, and Kolmogorov complexity.
Information theory is the mathematical science of quantifying, storing, and transmitting information. Founded in a landmark 1948 paper by Claude Shannon, it gave precise answers to questions that had seemed hopelessly vague: how much information is in a message, how efficiently data can be compressed, and how reliably signals can be sent across a noisy channel. In the decades since, the theory has reached into thermodynamics, cryptography, linguistics, machine learning, and quantum physics, making it one of the most powerful cross-disciplinary frameworks in all of applied mathematics.
Entropy and Information Measures
The central concept in information theory is Shannon entropy, a measure of the uncertainty — or, equivalently, the information content — of a probability distribution. To see why entropy is the right measure, start with a single outcome. If a source emits a symbol with probability , the self-information of that outcome is defined as
The logarithm base 2 gives the answer in bits. Outcomes that are very likely carry little surprise and therefore little information; rare outcomes carry a great deal. If , then — a certain event tells you nothing. If , then bits — learning that outcome occurred is worth three binary questions.
For a discrete random variable taking values in an alphabet with probability mass function , the Shannon entropy is the expected self-information:
where by convention . Entropy is non-negative, equals zero only when the outcome is certain, and achieves its maximum when is uniformly distributed. The binary entropy function is the entropy of a biased coin with heads probability ; it peaks at bit and is zero at or .
The machinery of entropy extends naturally to pairs of random variables. The joint entropy of is . The conditional entropy measures how much uncertainty remains in once is known:
These quantities obey the chain rule: . The total uncertainty in the pair equals the uncertainty in plus whatever uncertainty in remains after observing . This decomposes naturally to longer chains: .
The single most important derived quantity is mutual information, which measures how much knowing reduces uncertainty about :
Mutual information is always non-negative (by the log-sum inequality) and symmetric: . It vanishes if and only if and are statistically independent. This symmetry is remarkable — the information carries about is exactly the information carries about .
Also crucial is the Kullback-Leibler divergence (or relative entropy), which measures how different distribution is from a reference :
Unlike a distance, is not symmetric, but it is always non-negative (Gibbs’ inequality), equaling zero only when . The KL divergence governs error exponents in hypothesis testing and is intimately connected to mutual information: .
Shannon also connected entropy to thermodynamics. The Boltzmann-Gibbs entropy from statistical mechanics has exactly the same form (with Boltzmann’s constant and natural logarithm). Shannon explicitly acknowledged this parallel; indeed, the story is told that John von Neumann advised Shannon to call his quantity “entropy” since “nobody knows what entropy really is,” so Shannon would win any argument.
Source Coding and Data Compression
Shannon’s first fundamental theorem, the source coding theorem (1948), answers the question: how compactly can data from a source be encoded, in the limit? The answer is the entropy.
To make this precise, consider a discrete memoryless source emitting i.i.d. symbols from alphabet with distribution . The goal is to map sequences of symbols to binary strings as short as possible while still being able to recover the original sequence. The source coding theorem states that for any and all sufficiently large , there exists a lossless code with average codeword length per symbol no more than bits, and no lossless code can achieve fewer than bits per symbol.
The proof hinges on the Asymptotic Equipartition Property (AEP). By the law of large numbers applied to the random variable , the quantity converges in probability to . This means that for large , all but a tiny fraction of probability is concentrated on the typical set — the collection of sequences whose empirical probability is approximately . The typical set has only about elements, so bits suffice to index them. Sequences outside the typical set are so rare that they can be handled with negligible overhead. This argument is Shannon’s elegant shortcut around the combinatorial complexity of all possible sequences.
For finite-length codes, the optimal algorithm in terms of average codeword length is the Huffman code, constructed greedily by repeatedly merging the two least-probable symbols. Huffman coding achieves an average length between and bit — always within one bit of optimal. The Kraft inequality (where are codeword lengths) is the necessary and sufficient condition for the existence of a uniquely decodable prefix code, and the Huffman code saturates this inequality optimally.
When near-perfect compression is needed across long sequences, arithmetic coding performs even better: it assigns a single real number in to an entire sequence, achieving average codeword length arbitrarily close to per symbol without the integer-length constraint that limits Huffman. The Lempel-Ziv algorithms (LZ77 and LZ78, and the derivative LZW used in GIF and ZIP formats) achieve the same entropy limit asymptotically without knowledge of the source distribution — they learn the source on the fly.
So far, all of this is lossless compression. In lossy compression, the goal is to represent the source within an allowable distortion while minimizing the bit rate . The rate-distortion function gives the minimum rate required for distortion level , and Shannon’s rate-distortion theorem characterizes it as . JPEG image compression and MP3 audio encoding are practical realizations of this principle: they sacrifice some fidelity (high-frequency detail invisible or inaudible to humans) to achieve dramatically lower bit rates.
Channel Capacity and Noisy Channel Coding
Shannon’s second fundamental theorem is even more surprising: no matter how noisy a communication channel is, as long as it transmits any information at all, reliable communication is possible at some positive rate. The maximum such rate is the channel capacity .
A discrete memoryless channel (DMC) is described by an input alphabet , output alphabet , and a conditional distribution specifying the probability of receiving output when input is sent. Each channel use is independent of all others (hence “memoryless”). The channel capacity is:
where the maximum is over all input distributions. Capacity is measured in bits per channel use.
For the binary symmetric channel (BSC) with crossover probability (each bit is flipped independently with probability ), the capacity is , where is the binary entropy function. When , the channel is noiseless and bit. When , the output is independent of the input and . For the additive white Gaussian noise (AWGN) channel with and power constraint , the famous Shannon-Hartley theorem gives:
This formula, often written in terms of bandwidth as , is the foundation of all modern wireless communication design.
The noisy-channel coding theorem states: for any rate , there exist codes of length with codewords such that the maximum probability of decoding error goes to zero as . Conversely, for any rate , the error probability is bounded away from zero for any code. Shannon’s proof was non-constructive — he showed that a randomly chosen codebook works in expectation — and it took decades to find explicit codes that approach capacity.
Fano’s inequality makes the converse direction precise: if we try to communicate at rate , then any decoder must incur a probability of error satisfying , which grows without bound as increases. This elegant inequality links the error rate directly to the conditional entropy of the message given the received sequence, unifying the probabilistic and information-theoretic perspectives on communication limits.
Error-Correcting Codes
Shannon’s coding theorem is existential — it proves that good codes exist without building them. The field of coding theory constructs explicit codes and efficient decoding algorithms that approach the theoretical limit.
A block code of length over an alphabet of size is a set of codewords. The code rate is symbols per channel use. Error correction relies on the Hamming distance between codewords — the number of positions in which they differ. A code with minimum Hamming distance can detect up to errors and correct up to errors.
Linear codes over a finite field are subspaces of ; they are described by a generator matrix (whose rows span the code) or equivalently by a parity-check matrix (whose null space is the code). A linear code encodes message symbols into channel symbols with minimum distance . The Hamming codes — constructed by Richard Hamming in 1950 while he was frustrated by errors on punch-card computers — are perfect linear codes: every received word is within distance 1 of exactly one codeword, so single-bit errors are always correctable and no redundancy is wasted. The Hamming code adds 3 parity bits to 4 message bits, enabling single-error correction.
Reed-Solomon codes, introduced by Irving Reed and Gustave Solomon in 1960, are codes over large alphabets () whose codewords are evaluations of low-degree polynomials. An Reed-Solomon code achieves the Singleton bound with equality — it is a maximum distance separable (MDS) code, packing the maximum possible distance into a given rate. Reed-Solomon codes are deployed universally: they protect data on CDs and DVDs (correcting entire burst errors from scratches), safeguard data in RAID storage, and underlie QR codes. The algebraic structure that makes them powerful — polynomial interpolation over finite fields — also makes decoding tractable via the Berlekamp-Massey algorithm.
Modern coding theory shifted dramatically in the 1990s with the discovery of turbo codes (Berrou and Glavieux, 1993) and the rediscovery of low-density parity-check (LDPC) codes (originally due to Gallager in 1962, but impractical then). Both achieve rates within fractions of a decibel of the Shannon limit. LDPC codes are defined by sparse parity-check matrices represented as bipartite Tanner graphs, and they are decoded by belief propagation — an iterative message-passing algorithm on the graph. More recently, polar codes (Arikan, 2009) achieve capacity with a provably optimal successive-cancellation decoder and are used in 5G wireless standards, marking the first time a capacity-achieving code made it from theory to global deployment.
Network Information Theory
Classical information theory treats point-to-point communication: one sender, one receiver. Network information theory extends the framework to multi-terminal settings, where multiple senders and receivers share a common medium, and the results are far richer — and far less complete.
The multiple access channel (MAC) models many senders transmitting to a single receiver, as in a cellular uplink. The fundamental object is the capacity region — the set of simultaneously achievable rate pairs . For the two-user Gaussian MAC, this is a pentagon characterized by the constraints , , and . The sum-rate constraint shows that the two users compete for the same channel resource, but the corner points of the capacity region are achieved by successive interference cancellation — the receiver decodes one user’s message and subtracts it before decoding the other.
The broadcast channel is the reverse: one sender, multiple receivers. Despite being the dual of the MAC, characterizing its capacity region is much harder. For the degraded broadcast channel (where the channel outputs form a Markov chain), Cover proved that superposition coding is optimal — the sender encodes a “cloud” for the weaker receiver and a refinement for the stronger one, nested like layers of an onion.
The Slepian-Wolf theorem (1973) is a landmark result in distributed source coding. Suppose two correlated sources and are encoded separately and decoded jointly (as in a sensor network where sensors cannot communicate). Slepian and Wolf proved that the minimum total rate needed is , the joint entropy — no more than if the encoders could cooperate. Even though neither encoder sees the other’s data, the decoder can exploit correlations. The rates need only satisfy , , and — a rate region whose boundary traces the Slepian-Wolf corner.
Network coding, pioneered by Ahlswede, Cai, Li, and Yeung in 2000, overturns the classical routing paradigm. In traditional networks, intermediate nodes only forward packets. Network coding allows them to transmit coded combinations of received packets, dramatically increasing throughput. The butterfly network example — where two sources share a bottleneck link to reach two sinks — shows that linear network coding achieves the min-cut capacity while routing alone cannot. The capacity of a single-source multicast network equals the minimum of the min-cuts from the source to each sink, achievable with linear coding over sufficiently large finite fields.
Kolmogorov Complexity and Algorithmic Information
Shannon entropy characterizes the average information in an ensemble — a collection of objects drawn from a known distribution. But what is the information content of a single, specific string? This question motivated Kolmogorov complexity, developed independently in the early 1960s by Andrei Kolmogorov, Ray Solomonoff, and Gregory Chaitin.
The Kolmogorov complexity of a finite binary string is the length of the shortest program for a fixed universal Turing machine that outputs and halts:
Intuitively, measures how compressible is — how concisely it can be described. A string of a million zeros has very low Kolmogorov complexity (the program “print ‘0’ one million times” is short). A truly random string has Kolmogorov complexity close to its own length, since no program shorter than can generate it.
The invariance theorem guarantees that the choice of universal Turing machine matters only up to a constant: if and are complexities with respect to two different universal machines, then for some constant independent of . This makes Kolmogorov complexity a machine-independent concept for sufficiently long strings.
The connection between Kolmogorov complexity and Shannon entropy is deep. For a random variable with distribution , the expected Kolmogorov complexity satisfies up to logarithmic additive terms. Kolmogorov complexity thereby generalizes entropy from ensembles to individual objects, providing an objective, distribution-free notion of information content.
A string is Martin-Löf random if it passes all effective statistical tests for randomness — equivalently, if for a constant . Most strings are random in this sense (the incompressible ones dominate), yet no specific random string can be certified random, because the very definition is non-constructive. Chaitin’s constant — the halting probability of a universal machine — is the canonical example: its bits are maximally random, yet the number itself is well-defined. Knowing finitely many bits of would solve the halting problem for programs up to the corresponding length.
The minimum description length (MDL) principle, rooted in this theory, provides a principled framework for statistical model selection: the best model for data is the one that minimizes the total description length of the model plus the data given the model. MDL formalizes Occam’s razor in information-theoretic terms and connects to Bayesian model comparison with Solomonoff’s universal prior — a probability distribution over all computable sequences weighted by their Kolmogorov complexity.
Quantum Information Theory
Classical information theory assumes that information is encoded in classical bits — physical systems that are definitively 0 or 1. Quantum information theory explores what happens when information is encoded in quantum states, governed by the laws of quantum mechanics. The field was energized by Peter Shor’s 1994 quantum factoring algorithm and has developed into a rich discipline spanning computation, communication, and cryptography.
The quantum analogue of a bit is the qubit — a two-level quantum system whose state is a unit vector in a two-dimensional complex Hilbert space, with . Unlike a classical bit, a qubit can be in a superposition of and , but measurement collapses it irreversibly to one of the basis states with probabilities and . Multiple qubits can be entangled — their joint state cannot be written as a product of individual states, so measuring one qubit instantly constrains the outcomes of measuring the others, regardless of their spatial separation.
The quantum analogue of Shannon entropy is the von Neumann entropy, defined for a quantum state described by density matrix :
where are the eigenvalues of . For a pure state , all eigenvalues except one are zero, and — a pure state contains no classical uncertainty. For the maximally mixed state on a -dimensional system, . Von Neumann entropy governs entanglement: for a bipartite pure state on , the entanglement entropy is , where is the reduced state of subsystem .
The quantum channel capacity is more subtle than its classical counterpart. A quantum channel (formally, a completely positive trace-preserving map) has a quantum capacity for transmitting qubits reliably, a classical capacity for transmitting classical bits, and an entanglement-assisted classical capacity (the quantum mutual information, maximized over input states). Strikingly, the entanglement-assisted capacity equals for an appropriate entropy of the channel, and it achieves a single-letter formula — unlike the unassisted quantum capacity, which generally requires regularization over many channel uses and remains poorly understood for even simple channels.
Quantum error correction must contend with continuous errors (any unitary perturbation is possible) and the no-cloning theorem (quantum states cannot be copied). The resolution, due to Shor (1995) and Steane (1996), is that quantum codes do not copy states but instead spread their information across entangled multi-qubit systems, allowing errors to be detected and corrected without ever learning the encoded state directly. The 9-qubit Shor code encodes 1 logical qubit into 9 physical qubits, protecting against any single-qubit error. Modern stabilizer codes and topological codes (including the toric code) achieve much higher efficiency.
Quantum key distribution (QKD), exemplified by the BB84 protocol (Bennett and Brassard, 1984), uses the laws of physics to guarantee information-theoretic security. Any eavesdropper on the quantum channel necessarily disturbs the quantum states — a fact detectable by the legitimate parties. The security of QKD is not based on computational hardness assumptions (unlike RSA) but on the fundamental constraints of quantum mechanics, making it secure even against adversaries with unlimited computational power, including quantum computers.