Proof Systems

Formalizing deduction — natural deduction, sequent calculus, and the relationship between provability and truth.


A proof system is a purely syntactic machine for deriving theorems: given a set of axioms and inference rules, it produces new formulas without ever appealing to truth or meaning. Where semantics asks “is this sentence true in every model?”, a proof system asks “can this sentence be derived from the rules?” The remarkable fact — established by completeness theorems — is that these two questions have the same answer for classical logic. But the syntactic perspective stands on its own, and different proof systems illuminate different aspects of deduction. Natural deduction mirrors how mathematicians actually reason; Hilbert-style systems strip inference down to a single rule; sequent calculus exposes the deep symmetries of logical connectives. Understanding why these systems exist, and how they relate, is the starting point for proof theory, automated reasoning, and the foundations of mathematics.

Natural Deduction

In 1934, Gerhard Gentzen introduced natural deduction with the explicit goal of capturing how mathematicians actually think. Rather than deriving everything from a fixed set of axioms, natural deduction provides paired introduction and elimination rules for each logical connective — rules that say how to build a formula up and how to take it apart. The system was a departure from earlier formalisms: proofs look like trees of reasoning steps, not chains of axiom substitutions, and the rules are designed so that each connective is fully characterized by the way it is used.

The introduction and elimination rules for propositional connectives are as follows. For conjunction: \land-intro derives φψ\varphi \land \psi from separate proofs of φ\varphi and ψ\psi; \land-elim extracts either conjunct from a proof of φψ\varphi \land \psi. For disjunction: \lor-intro derives φψ\varphi \lor \psi from a proof of either disjunct; \lor-elim (sometimes called “proof by cases”) derives χ\chi from φψ\varphi \lor \psi together with proofs that φχ\varphi \to \chi and ψχ\psi \to \chi. For implication: \to-intro derives φψ\varphi \to \psi by assuming φ\varphi and deriving ψ\psi, then discharging the assumption; \to-elim is the familiar modus ponens, deriving ψ\psi from φ\varphi and φψ\varphi \to \psi. For negation: ¬\neg-intro derives ¬φ\neg \varphi by assuming φ\varphi and deriving a contradiction (\bot); ¬\neg-elim (or ex falso quodlibet) derives any formula from \bot.

Classical logic adds a further rule: reductio ad absurdum (proof by contradiction). To prove φ\varphi, assume ¬φ\neg \varphi and derive \bot; then discharge the assumption and conclude φ\varphi. This rule is what separates classical natural deduction from its intuitionistic counterpart, which omits it and thereby restricts itself to constructive reasoning. Beyond the logical rules, natural deduction also employs structural rules that govern the management of assumptions: weakening allows unused assumptions to be carried along in a proof, contraction permits using the same assumption more than once, and exchange says the order in which assumptions are listed does not matter.

The system extends naturally to first-order logic by adding quantifier rules. Universal introduction (\forall-intro) derives xφ(x)\forall x\, \varphi(x) from a proof of φ(a)\varphi(a) where aa is an arbitrary parameter not appearing in any undischarged assumption — this captures the “let aa be arbitrary” move in mathematical proofs. Universal elimination (\forall-elim) derives φ(t)\varphi(t) from xφ(x)\forall x\, \varphi(x) for any term tt. Existential introduction (\exists-intro) derives xφ(x)\exists x\, \varphi(x) from φ(t)\varphi(t) for a specific witness tt. Existential elimination (\exists-elim) derives χ\chi from xφ(x)\exists x\, \varphi(x) and a subproof showing χ\chi follows from φ(a)\varphi(a), where aa is fresh.

As an example, consider deriving A(BA)A \to (B \to A). We assume AA (assumption 1), then assume BB (assumption 2). From assumption 1 we have AA immediately. By \to-intro, discharging assumption 2, we get BAB \to A. By \to-intro again, discharging assumption 1, we get A(BA)A \to (B \to A). The proof is a nested discharge of assumptions — exactly the pattern a working mathematician would use when told “suppose AA; now suppose also BB; clearly AA still holds, so BAB \to A.”

The key insight of natural deduction is this correspondence between formal derivations and informal mathematical reasoning. Gentzen designed the system so that a proof of φψ\varphi \to \psi is literally a procedure that transforms a proof of φ\varphi into a proof of ψ\psi. This idea — that proofs have computational content — later became the foundation of the Curry-Howard correspondence, linking proof theory to type theory and lambda calculus.

Hilbert-Style Axiom Systems

Hilbert-style systems take the opposite approach to natural deduction: instead of many inference rules and no axioms, they use many axiom schemas and a single inference rule. The design reflects David Hilbert’s formalist program of the early 1900s, which sought to reduce all of mathematics to a minimal, mechanically verifiable foundation. The result is a system of extreme economy — easy to reason about but painful to reason in.

A standard axiomatization of classical propositional logic uses three axiom schemas and one rule. The axioms are:

(A1)φ(ψφ)\text{(A1)}\quad \varphi \to (\psi \to \varphi)

(A2)(φ(ψχ))((φψ)(φχ))\text{(A2)}\quad (\varphi \to (\psi \to \chi)) \to ((\varphi \to \psi) \to (\varphi \to \chi))

(A3)(¬φ¬ψ)(ψφ)\text{(A3)}\quad (\neg \varphi \to \neg \psi) \to (\psi \to \varphi)

The sole inference rule is modus ponens: from φ\varphi and φψ\varphi \to \psi, derive ψ\psi. Every instance of these schemas — obtained by substituting any formulas for φ\varphi, ψ\psi, χ\chi — is an axiom. Axiom A1 says that any true proposition is implied by anything; A2 distributes implication over itself; A3 encodes the contrapositive, which gives the system its classical character.

The central metatheorem of Hilbert-style systems is the deduction theorem: if Γ{φ}ψ\Gamma \cup \{\varphi\} \vdash \psi (i.e., ψ\psi is derivable from Γ\Gamma together with φ\varphi), then Γφψ\Gamma \vdash \varphi \to \psi. This is not a rule of the system but a theorem about the system, and its proof is a constructive procedure that transforms any derivation using an extra assumption φ\varphi into a longer derivation that avoids the assumption by wrapping everything under φ()\varphi \to (\cdots). The deduction theorem bridges the gap between Hilbert-style proofs and the more intuitive assumption-based reasoning of natural deduction.

This axiomatization is complete for classical propositional logic: every tautology is derivable. The proof of completeness, combined with the soundness of modus ponens, shows that provability in the Hilbert system coincides exactly with semantic validity. For first-order logic, the axiom set is extended with quantifier axioms — such as xφ(x)φ(t)\forall x\, \varphi(x) \to \varphi(t) (universal instantiation) — and a generalization rule that derives xφ(x)\forall x\, \varphi(x) from φ(x)\varphi(x) when xx is not free in any undischarged assumption.

The trade-off of Hilbert systems is one of economy versus usability. The system has a beautifully small set of primitives, which makes it ideal for metamathematical analysis — proofs of completeness, consistency, and independence are cleanest in this setting. But actual derivations are notoriously long. Proving something as elementary as φφ\varphi \to \varphi (the identity) requires several non-obvious steps using A1 and A2, whereas in natural deduction it is a one-step assumption discharge. This verbosity is inherent: Hilbert-style proofs are often exponentially longer than their natural deduction counterparts for the same theorem. In practice, logicians typically work in richer systems and invoke the equivalence to Hilbert systems only when they need the metamathematical leverage that minimality provides.

Sequent Calculus and Cut Elimination

Gentzen introduced the sequent calculus alongside natural deduction in his landmark 1934 paper, but it was designed for a different purpose: not to mirror mathematical reasoning, but to make the structure of proofs amenable to mathematical analysis. The fundamental object is the sequent, written ΓΔ\Gamma \vdash \Delta, where Γ\Gamma and Δ\Delta are finite multisets of formulas. The intended reading is “if all formulas in Γ\Gamma hold, then at least one formula in Δ\Delta holds.” The two-sided presentation — with formulas on both left and right of the turnstile — reveals a duality between assumptions and conclusions that is invisible in other proof systems.

Each connective receives left rules and right rules. For instance, the right rule for conjunction (\land-R) derives ΓΔ,φψ\Gamma \vdash \Delta, \varphi \land \psi from ΓΔ,φ\Gamma \vdash \Delta, \varphi and ΓΔ,ψ\Gamma \vdash \Delta, \psi — to conclude a conjunction on the right, prove both conjuncts. The left rule for conjunction (\land-L) derives Γ,φψΔ\Gamma, \varphi \land \psi \vdash \Delta from Γ,φΔ\Gamma, \varphi \vdash \Delta (or from Γ,ψΔ\Gamma, \psi \vdash \Delta) — to use a conjunction on the left, extract a conjunct. Similarly, \to-R introduces an implication on the right by moving the antecedent to the left of the turnstile, and \to-L splits: to use φψ\varphi \to \psi on the left, one must prove φ\varphi (on the right) and use ψ\psi (on the left). The structural rules — weakening, contraction, and exchange — operate on both sides of the sequent.

The most important rule in the sequent calculus is the cut rule:

ΓΔ,φΓ,φΔΓ,ΓΔ,Δ (cut)\frac{\Gamma \vdash \Delta, \varphi \qquad \Gamma', \varphi \vdash \Delta'}{\Gamma, \Gamma' \vdash \Delta, \Delta'} \text{ (cut)}

The cut rule says: if φ\varphi can be proved, and φ\varphi can be used to prove something else, then φ\varphi can be bypassed — the “middle man” is eliminated. Cut corresponds to the use of lemmas in ordinary mathematics: you prove a lemma, then use it. The cut formula φ\varphi need not appear in the conclusion, so the cut rule can introduce formulas that are not subformulas of the final theorem.

Gentzen’s Hauptsatz (main theorem), also from 1934, states that the cut rule is admissible: any sequent provable with cuts is also provable without cuts. This is the cut-elimination theorem, and it is the single most important result in structural proof theory. The proof is a complex double induction — on the complexity of the cut formula and the depth of the derivation — that systematically pushes cuts upward through the proof tree until they disappear.

The consequences of cut elimination are profound. First, it yields the subformula property: in a cut-free proof of ΓΔ\Gamma \vdash \Delta, every formula appearing in the derivation is a subformula of some formula in Γ\Gamma or Δ\Delta. This means proofs cannot go “outside” the vocabulary of the theorem — they are analytic. The subformula property immediately implies consistency: there is no cut-free proof of the empty sequent \vdash (with nothing on either side), since there are no subformulas to work with. Gentzen used this idea in 1936 to give the first constructive consistency proof of Peano arithmetic, using transfinite induction up to the ordinal ε0\varepsilon_0.

Cut elimination also implies Craig’s interpolation theorem: if φψ\varphi \vdash \psi is provable, then there exists an interpolant θ\theta — a formula whose non-logical vocabulary appears in both φ\varphi and ψ\psi — such that φθ\varphi \vdash \theta and θψ\theta \vdash \psi are both provable. The interpolant can be read off from a cut-free proof by examining which formulas mediate between the left and right sides. Interpolation has far-reaching applications in model theory, database theory, and software verification.

Proof Complexity and Automated Reasoning

The study of proof complexity asks a deceptively simple question: how long must a proof be? Different proof systems can produce dramatically different proof lengths for the same theorem, and understanding these separations connects proof theory directly to computational complexity.

The foundational framework was laid by Stephen Cook and Robert Reckhow in 1979. They defined a proof system (in the general sense) as a polynomial-time computable function ff such that the range of ff is exactly the set of tautologies. A string π\pi is a proof of φ\varphi if f(π)=φf(\pi) = \varphi — that is, proofs are efficiently verifiable certificates of validity. Under this definition, the question “does every tautology have a polynomial-length proof in some proof system?” is equivalent to the question NP=coNP\mathrm{NP} = \mathrm{coNP}. If PNP\mathrm{P} \ne \mathrm{NP}, then no single proof system can have short proofs for all tautologies.

The most striking concrete results involve resolution, a proof system based on a single rule: from clauses (Ax)(A \lor x) and (B¬x)(B \lor \neg x), derive (AB)(A \lor B). J. Alan Robinson introduced the resolution principle in 1965 as a foundation for automated theorem proving. Resolution is refutation-complete for propositional logic: a set of clauses is unsatisfiable if and only if the empty clause can be derived. However, resolution proofs can be exponentially long. The classic example is the pigeonhole principle PHPnn+1\mathrm{PHP}_n^{n+1}, which states that n+1n+1 pigeons cannot be placed into nn holes without a collision. Armin Haken proved in 1985 that every resolution proof of PHPnn+1\mathrm{PHP}_n^{n+1} requires exponentially many steps in nn, even though the principle is an elementary combinatorial tautology. This exponential lower bound demonstrates a fundamental limitation of resolution-based reasoning.

In practice, resolution is the backbone of modern SAT solvers, which have become remarkably effective despite worst-case exponential bounds. The DPLL algorithm (Davis-Putnam-Logemann-Loveland, 1962) searches for satisfying assignments using unit propagation and backtracking — essentially performing resolution implicitly. Modern conflict-driven clause learning (CDCL) solvers extend DPLL with learned clauses derived from conflicts, enabling them to solve industrial-scale instances with millions of variables. The gap between worst-case exponential bounds and practical polynomial-time behavior on structured instances is one of the central mysteries of computational complexity.

Beyond propositional reasoning, automated deduction for first-order logic builds on Robinson’s resolution with unification — an algorithm that finds substitutions making two terms identical. First-order resolution, combined with strategies like set-of-support and hyperresolution, powered early automated theorem provers. The modern landscape includes both resolution-based systems and more sophisticated approaches: superposition calculus for equational reasoning, tableaux methods that search for counterexamples, and SMT solvers (satisfiability modulo theories) that combine SAT solving with domain-specific decision procedures for arithmetic, arrays, and other theories.

The connection between proof theory and computation reaches its fullest expression in interactive theorem provers such as Coq, Lean, and Isabelle. These systems implement formal proof calculi — typically based on dependent type theory or higher-order logic — and allow mathematicians to construct machine-verified proofs. The Curry-Howard correspondence ensures that every proof in these systems is also a program, and every program is a proof. Major mathematical results have been formalized in these systems, including the four-color theorem (in Coq, 2005) and the Kepler conjecture (in Isabelle and HOL Light, 2014). As the technology matures, the boundary between proof theory and software engineering continues to dissolve, and the question of what constitutes a rigorous proof is being reshaped by the tools that Gentzen’s work made possible.