Information Theory
The mathematical study of information, entropy, and communication — from Shannon's foundational work to modern coding theory.
Information theory is the mathematical framework for quantifying, storing, and communicating information. It answers questions that seem almost philosophical — what is information? how much of it does a message contain? what are the absolute limits of data compression and reliable communication? — with precise, provable results. The field was single-handedly created by Claude Shannon in his 1948 paper A Mathematical Theory of Communication, one of the most influential scientific papers of the twentieth century, and its ideas now permeate everything from wireless communications and streaming video to machine learning and cryptography.
Entropy and the Measure of Information
The central insight of information theory is that information can be measured, and the right way to measure it is through surprise. An event that is highly likely carries little information when it occurs; an event that is highly unlikely carries a great deal. Shannon formalized this intuition with a quantity called self-information: for an event with probability , the self-information is , measured in bits. An event with probability carries exactly one bit; an event with probability carries three bits.
Shannon wanted a measure for an entire random source — one that captures the average information produced per symbol. This measure is Shannon entropy, defined for a discrete random variable with possible values as:
Entropy is the expected value of self-information, averaged over all possible outcomes weighted by their probabilities. It reaches its maximum value of when all outcomes are equally likely — the state of maximum uncertainty — and its minimum value of zero when one outcome is certain and all others impossible. For a binary random variable with probability of being 1 and of being 0, the entropy is the binary entropy function , which peaks at with a value of one bit.
Shannon showed in 1948 that entropy is the only measure satisfying three natural axioms: continuity in the probabilities, maximality for the uniform distribution, and additivity for decomposed choices. The name itself was suggested by John von Neumann, who reportedly told Shannon to call it entropy because “nobody knows what entropy really is, so in a debate you will always have the advantage.”
The concept extends to multiple variables. The joint entropy measures the total uncertainty in two variables taken together, while the conditional entropy measures how much uncertainty remains about once is known. These obey a chain rule: — the total uncertainty equals that of plus the residual uncertainty of given .
Mutual Information and Divergence
If entropy measures how much we do not know, mutual information measures how much knowing one thing tells us about another. For two random variables and , the mutual information is:
This quantity is symmetric — tells us as much about as tells us about — and non-negative: observing a variable can never increase our uncertainty about another. Mutual information is zero if and only if and are independent, and it equals the entropy of either variable when the other completely determines it. It can be visualized as the overlap in a Venn diagram of entropies, where and are two circles and is their intersection.
Closely related to mutual information is the Kullback-Leibler divergence (also called relative entropy), introduced by Solomon Kullback and Richard Leibler in 1951. For two probability distributions and over the same set of outcomes, the KL divergence is:
This measures the expected number of extra bits needed to encode data from using a code optimized for . The KL divergence is always non-negative (Gibbs’ inequality), equals zero if and only if , but is not symmetric and therefore not a true metric. Despite this asymmetry, it is central to statistics, machine learning, and Bayesian inference. Mutual information is simply the KL divergence between the joint distribution and the product of marginals . The Jensen-Shannon divergence remedies the asymmetry by averaging: where , producing a symmetric, bounded quantity whose square root is a true metric.
One of the most elegant results in information theory is the data processing inequality: if random variables form a Markov chain (meaning depends on only through ), then . Processing data can only destroy information, never create it. This principle has profound consequences: no clever algorithm applied to the output of a noisy channel can recover more information about the input than the channel itself transmits.
Source Coding and Data Compression
Shannon’s first major theorem, the source coding theorem (also called the noiseless coding theorem), establishes that entropy is the fundamental limit of lossless data compression. More precisely, the theorem states that a source producing symbols with entropy bits per symbol cannot be compressed to fewer than bits per symbol on average, and that compression to a rate arbitrarily close to is achievable. Entropy, in this sense, is the irreducible information content of the source.
The practical consequence is immediate: to compress data efficiently, one must assign short codes to likely symbols and long codes to unlikely ones. A prefix code (or prefix-free code) is a code in which no codeword is a prefix of any other, guaranteeing unique decodability. The Kraft inequality constrains the codeword lengths of any prefix code:
David Huffman discovered in 1952, as a student in Robert Fano’s class at MIT, an algorithm for constructing optimal prefix codes. Huffman coding builds a binary tree from the bottom up, repeatedly merging the two least-probable symbols, and produces a code whose expected length is within one bit of the entropy. The story goes that Fano had assigned the problem as a term paper alternative to a final exam, and Huffman’s elegant solution surpassed the method that Fano himself had devised. Huffman coding remains widely used and is a component of formats like JPEG and ZIP.
Arithmetic coding, developed in the 1970s and 1980s, encodes an entire message as a single number in the interval , successively narrowing the interval based on each symbol’s probability. It achieves rates closer to entropy than Huffman coding, especially for highly skewed distributions.
Beyond symbol-by-symbol coding, dictionary-based methods exploit higher-order structure. The Lempel-Ziv family of algorithms — LZ77, LZ78, and their descendants like LZW — replace repeated patterns with references to earlier occurrences. Abraham Lempel and Jacob Ziv proved in 1977-78 that their algorithms are universal: they asymptotically achieve the entropy rate of any stationary ergodic source, without knowing the source statistics in advance. This universality is why LZ-based algorithms power nearly every general-purpose compression utility, from gzip to PNG.
Channel Capacity and Reliable Communication
Shannon’s second great theorem addresses a different problem: communication over a noisy channel. Noise corrupts transmitted symbols, introducing errors. The natural expectation would be that the only way to reduce errors is to slow down transmission — to add so much redundancy that errors become rare. Shannon proved something far more surprising.
A discrete memoryless channel is described by a set of input symbols, a set of output symbols, and a conditional probability distribution specifying the probability of receiving when is sent. The channel capacity is the maximum mutual information between input and output, optimized over all possible input distributions:
For a binary symmetric channel (BSC) that flips each bit independently with probability , the capacity is bits per use, where is the binary entropy function. For a binary erasure channel (BEC) that either transmits a bit correctly or replaces it with an erasure symbol with probability , the capacity is simply .
Shannon’s noisy channel coding theorem (1948) states: for any rate , there exist codes that allow communication at rate with error probability approaching zero as block length grows. Conversely, at any rate , reliable communication is impossible. The theorem is non-constructive — Shannon’s proof uses random coding arguments — and finding practical codes that approach capacity became one of the great engineering challenges of the following decades.
Fano’s inequality provides the converse direction, lower-bounding error probability in terms of conditional entropy. The Blahut-Arimoto algorithm (Blahut and Arimoto, 1972) gives an iterative method for computing capacity numerically. For Gaussian channels with additive white Gaussian noise, the capacity takes the celebrated form , where SNR is the signal-to-noise ratio — the Shannon-Hartley theorem.
Channel Coding and Error Correction
The search for practical codes that approach Shannon’s capacity limit has driven decades of research in coding theory. The simplest strategy is a repetition code, which transmits each bit multiple times and takes a majority vote at the receiver. This reduces errors but is extremely wasteful — to achieve low error rates, one must repeat each bit many times, making the code rate very low.
Linear block codes provide a far more structured approach. A binary linear code encodes -bit messages into -bit codewords, and the set of all codewords forms a subspace of , specified by a generator matrix or a parity-check matrix . The Hamming distance between codewords — the number of differing positions — determines error-correcting ability: a code with minimum distance corrects up to errors. Richard Hamming introduced the first single-error-correcting codes in 1950, motivated by the frequent errors in early relay-based computers at Bell Labs.
Convolutional codes, introduced by Peter Elias in 1955, process data as a continuous stream rather than in fixed-size blocks. The encoder has memory and its output depends on the current input and several previous inputs. Decoding is performed using the Viterbi algorithm (Andrew Viterbi, 1967), which finds the most likely transmitted sequence by searching a trellis diagram. Convolutional codes dominated practical applications for decades and were used in the Voyager space missions and the GSM cellular standard.
The modern era of coding theory began with two breakthroughs. Turbo codes, introduced by Claude Berrou, Alain Glavieux, and Punya Thitimajshima in 1993, concatenate two convolutional codes with an interleaver and use iterative decoding to achieve performance within a fraction of a decibel of Shannon capacity. Almost simultaneously, the rediscovery of low-density parity-check (LDPC) codes — originally invented by Robert Gallager in his 1960 doctoral thesis but largely forgotten — showed that these sparse-graph codes, decoded by belief propagation on factor graphs, also approach capacity. Most recently, polar codes, invented by Erdal Arikan in 2009, are the first provably capacity-achieving codes with efficient encoding and decoding, based on the phenomenon of channel polarization: through recursive transformations, synthetic channels are created that are either nearly perfect or nearly useless, and data is transmitted only on the good channels. Polar codes are now part of the 5G wireless standard.
Rate-Distortion Theory and Lossy Compression
Lossless compression preserves data exactly, but many applications — images, audio, video — tolerate some loss of fidelity in exchange for dramatically higher compression. Rate-distortion theory, also developed by Shannon, provides the mathematical framework for this tradeoff.
A distortion measure quantifies the cost of reproducing source symbol as . Common choices include squared error for continuous sources and Hamming distance for discrete sources. The rate-distortion function specifies the minimum number of bits per symbol needed to represent the source with average distortion at most :
For a Gaussian source with variance under squared-error distortion, the rate-distortion function takes the elegant form for , and zero for . This means that reducing the distortion by half requires one additional bit per symbol — a precise quantification of the quality-versus-size tradeoff.
Quantization is the practical mechanism for lossy compression. A scalar quantizer maps each input value to one of a finite set of reconstruction points. The Lloyd-Max algorithm iteratively refines these points to minimize mean squared error. Vector quantization extends this to blocks of samples, exploiting correlations — rate-distortion theory guarantees it can always match or beat scalar quantization.
Modern lossy compression standards — JPEG for images, MP3 and AAC for audio, H.264 and H.265 for video — combine transform coding, quantization, and entropy coding. While they do not achieve the rate-distortion bound exactly, their design is guided by it. Perceptual coding allocates bits to perceptually important features and discards information that human senses cannot detect.
Kolmogorov Complexity and Algorithmic Information
While Shannon entropy measures the information content of a source — a probability distribution over messages — there is an alternative perspective that measures the information content of an individual string, without reference to any probability model. This is Kolmogorov complexity, introduced independently by Ray Solomonoff (1960), Andrey Kolmogorov (1965), and Gregory Chaitin (1966).
The Kolmogorov complexity of a string is the length of the shortest program that produces on a fixed universal Turing machine. A highly regular string like has low complexity — a short program generates it — while a string produced by fair coin flips almost certainly has complexity close to its own length. A string is incompressible (or algorithmically random) if , and a counting argument shows that most strings are incompressible.
Kolmogorov complexity is uncomputable: no algorithm can determine for arbitrary , by a diagonalization argument reminiscent of the halting problem. Despite this, it is a powerful theoretical tool. The connection to Shannon entropy is deep: for a random variable with distribution , is close to , up to a constant depending only on the choice of universal machine.
Kolmogorov complexity provides a foundation for algorithmic information theory, which reinterprets probability, statistics, and inductive inference in computational terms. The principle of Occam’s razor — prefer simpler explanations — receives a formal justification through Solomonoff’s universal prior, which assigns to each string a probability exponentially decreasing in its Kolmogorov complexity. This connection between information, computation, and learning continues to influence modern machine learning, where the minimum description length principle uses compression as a guide for model selection.
Information-Theoretic Security and Beyond
Information theory provides a rigorous foundation for reasoning about the security of cryptographic systems, independent of any assumptions about computational hardness. Shannon himself laid this foundation in his 1949 paper Communication Theory of Secrecy Systems, which was originally a classified Bell Labs report from 1945.
Perfect secrecy requires that observing the ciphertext gives an adversary no information about the plaintext: . Shannon proved that this demands a key at least as long as the message, making the one-time pad (the Vernam cipher, patented by Gilbert Vernam in 1917) essentially the only perfectly secret cipher. Encryption is bitwise XOR with a uniformly random key; used once it is unbreakable, reused it is fatally insecure.
This impossibility result motivated computational security, where breaking a cipher requires infeasible computation even if it is theoretically breakable. The wiretap channel, introduced by Aaron Wyner in 1975, models eavesdropping on a degraded version of the transmitted signal. Wyner showed that positive secrecy capacity exists when the eavesdropper’s channel is noisier than the legitimate receiver’s — enabling secure communication without any shared secret key.
Beyond point-to-point communication, network information theory extends the framework to multi-user settings. The Slepian-Wolf theorem (1973) showed that two correlated sources can be compressed separately but decompressed jointly, achieving the same total rate as if they were compressed together — a counterintuitive result with applications in sensor networks and distributed storage. For sources producing sequences over time, the entropy rate captures the per-symbol entropy in the limit of long sequences, and the asymptotic equipartition property (AEP) — the information-theoretic law of large numbers — partitions all possible long sequences into a small typical set of roughly equally likely sequences and a vast atypical set that is almost never observed. The AEP is the engine behind almost every asymptotic result in the field.
Information theory continues to expand. In machine learning, mutual information and KL divergence appear in variational autoencoders, the information bottleneck method, and analyses of generalization. Quantum information theory extends Shannon’s framework with von Neumann entropy, qubits, and entanglement as a communication resource, while the BB84 protocol (Charles Bennett and Gilles Brassard, 1984) achieves provably secure key exchange using quantum mechanics. And the deep connections between information and physics — embodied in Landauer’s principle, which states that erasing one bit of information dissipates at least joules of heat — suggest that information is not merely an abstract mathematical concept but a fundamental physical quantity, as real as energy or matter.