Optimization
Linear, convex, and nonlinear programming — finding the best among many.
Optimization is the mathematical discipline of finding the best element — the maximum or minimum of a function — from a set of available choices. It is one of the oldest and most practically consequential areas of applied mathematics, underpinning everything from the design of aircraft wings to the training of neural networks. At its core, optimization asks a simple question: given a measure of quality and a set of allowable configurations, which configuration is the best?
Fundamentals of Optimization
An optimization problem in its most general form consists of three ingredients: a set of decision variables , an objective function to be minimized or maximized, and a feasible set of allowed values. The problem is written as
where the feasible set is typically described by equality constraints and inequality constraints . A point is a global minimum if for all , and a local minimum if there exists a neighborhood of such that for all .
The distinction between local and global optima is not merely pedantic — it is the central difficulty of optimization. Most algorithms can only reliably find local optima, and in general the problem of locating a global optimum is computationally intractable. The exception arises when the problem has special structure. The most important such structure is convexity: a function is convex if for all and ,
A set is convex if it contains all line segments between its points. When is convex and is a convex set, every local minimum is automatically a global minimum — the landscape has no traps, and efficient algorithms can find the optimum reliably.
Optimality conditions tell us what must hold at a minimum. For an unconstrained problem with a differentiable objective, the first-order necessary condition is that the gradient vanishes: . The second-order necessary condition requires that the Hessian be positive semidefinite. Conversely, if and is positive definite, then is a strict local minimum. These conditions are the multivariable analogs of the familiar single-variable tests from calculus, and they were systematically developed throughout the 19th century by Carl Friedrich Gauss, Adrien-Marie Legendre, and others working on least-squares problems.
When the feasible set is defined by constraints, the conditions become more subtle. The key tool is the Karush-Kuhn-Tucker (KKT) framework: at a local minimum of subject to and , under suitable constraint qualifications, there exist multipliers and such that
along with complementary slackness: for each . These conditions were derived independently by William Karush in his 1939 master’s thesis and by Harold Kuhn and Albert Tucker in 1951, and they form the cornerstone of constrained optimization theory.
Linear Programming
The simplest structured optimization problem is linear programming (LP): minimize a linear objective over a polyhedron defined by linear inequalities. The standard form is
where , , and . The feasible set — the polyhedron or polytope — is a convex region with flat faces. A fundamental geometric fact, the fundamental theorem of linear programming, asserts that if the optimal value is attained, it is attained at a vertex (also called a basic feasible solution) of the polytope. This reduces the infinitely many feasible points to finitely many candidates.
This geometric insight drove George Dantzig’s invention of the simplex method in 1947, one of the most influential computational discoveries of the 20th century. The simplex method moves from vertex to adjacent vertex of the polytope, always improving the objective, until no further improvement is possible. Each step — called a pivot — identifies an entering variable and a leaving variable, updating the basis that represents the current vertex. In practice, the simplex method runs very quickly, typically visiting only a small fraction of the exponentially many vertices. Yet in the worst case, as Victor Klee and George Minty showed in 1972, it can visit all vertices — so the simplex method is not polynomial in the worst case.
The first provably polynomial algorithm for linear programming was Leonid Khachiyan’s ellipsoid method (1979), which operates by shrinking ellipsoids around the feasible set rather than traversing vertices. More practically significant were the interior point methods, particularly Narendra Karmarkar’s algorithm (1984), which moves through the interior of the polytope along a central path. Modern predictor-corrector interior point methods, developed by Mehrotra in the 1990s, achieve polynomial complexity and often outperform simplex on large instances.
Duality is perhaps the deepest structural feature of linear programming. Every LP has an associated dual problem: if the primal is subject to , , then the dual is subject to . Weak duality says the dual objective always lower-bounds the primal objective for feasible points. Strong duality — proved by Dantzig and independently by von Neumann — says the two optimal values are equal when both problems are feasible: . Duality has profound consequences for sensitivity analysis, economic interpretation (the dual variables are shadow prices expressing the marginal value of each constraint), and algorithm design. The dual simplex method, complementary slackness conditions, and parametric analysis all flow from this duality structure.
Convex Optimization and Duality
Convex optimization extends the clean theory of linear programming to problems with nonlinear but convex objectives and constraints. Because every local optimum is global and strong duality holds under mild conditions, convex optimization problems are often tractable even when they are not linear.
A convex program takes the form subject to and , where and all are convex. The KKT conditions are both necessary and sufficient for global optimality when the functions are convex — a remarkable fact that makes the KKT system a complete characterization of the solution. The key regularity assumption ensuring the KKT conditions hold is Slater’s condition: the existence of a strictly feasible point satisfying all inequality constraints with strict inequality.
Lagrangian duality generalizes LP duality to convex programs. The Lagrangian is
and the dual function is always concave (as an infimum of linear functions of ). The dual problem is . The duality gap is zero under Slater’s condition — this is strong duality for convex programs, a result proved in full generality by Werner Fenchel and later extended by R. Tyrrell Rockafellar in his foundational 1970 treatise Convex Analysis.
Important special classes of convex programs have exploitable structure. Quadratic programs (QPs) minimize a convex quadratic objective over a polyhedral set and arise ubiquitously in least-squares problems, support vector machines, and model predictive control. Second-order cone programs (SOCPs) generalize QPs by allowing second-order cone constraints of the form ; they capture portfolio optimization and robust linear programs. Semidefinite programs (SDPs) optimize over the cone of positive semidefinite matrices and have become a central tool in combinatorial optimization (through relaxations), control theory, and polynomial optimization. All of these can be solved in polynomial time by interior point methods.
The subdifferential extends differentiation to convex but non-smooth functions. A vector is a subgradient of at if for all . The set of all subgradients at is the subdifferential . Optimality for a convex program becomes , and algorithms like the proximal gradient method and ADMM (alternating direction method of multipliers) exploit this structure to handle non-smooth terms like the norm that appear in compressed sensing and sparse recovery.
Unconstrained and Constrained Methods
Practical optimization requires algorithms that converge reliably and quickly. The landscape of numerical methods divides broadly into first-order methods (using only gradients) and second-order methods (using also Hessians or their approximations).
Gradient descent is the foundational first-order method: , where is a step size or learning rate. For a convex function with Lipschitz gradient, gradient descent converges at rate — meaning iterations suffice to reach -accuracy from steps. Nesterov’s accelerated gradient descent (1983) achieves the optimal rate for smooth convex functions by adding a momentum term: the next iterate is built not from the current point but from an extrapolation beyond it. This acceleration, sometimes called fast gradient method, cannot be improved upon by any first-order algorithm — a lower bound also due to Nesterov.
Newton’s method uses second-order information: . Near a strict local minimum, Newton’s method converges quadratically — the error roughly squares at each step. The drawback is that computing and inverting the Hessian costs per iteration, which is prohibitive for large problems. Quasi-Newton methods maintain a low-rank approximation to the Hessian inverse that is updated using gradient information. The most successful quasi-Newton algorithm, BFGS (named for Broyden, Fletcher, Goldfarb, and Shanno, 1970), achieves superlinear convergence and is the default workhorse for medium-scale smooth optimization. L-BFGS stores only the most recent correction pairs rather than the full matrix, making it practical for high-dimensional problems.
For constrained nonlinear problems, sequential quadratic programming (SQP) solves a sequence of quadratic subproblems, each approximating the original problem near the current iterate. At each step, the objective is replaced by a quadratic model and the constraints are linearized, yielding a QP whose solution provides the next step. SQP methods achieve fast local convergence and are the basis of many industrial optimization solvers. Interior point methods for nonlinear programming replace inequality constraints with a log-barrier term , converting the constrained problem into an unconstrained one parameterized by a barrier parameter . Newton’s method applied to the barrier problem traces out a central path through the feasible interior, converging to the optimum as .
Integer and Combinatorial Optimization
When decision variables are required to take integer values — for instance, because they represent yes/no decisions or indivisible quantities — the problem becomes dramatically harder. Integer linear programs (ILPs) combine linear objectives and constraints with integrality requirements, and their feasible sets are no longer convex. The LP relaxation — obtained by dropping the integrality requirement — provides a lower bound on the optimal integer value, but the relaxation solution may be fractional and far from any integer feasible point.
The dominant exact algorithm for ILPs is branch and bound, introduced by Ailsa Land and Alison Doig in 1960. The method partitions the feasible set by branching on a fractional variable, creating subproblems whose LP relaxations are solved to obtain bounds. Subproblems whose bounds exceed the best known integer solution are pruned; the tree is explored until all branches are resolved. Cutting planes — additional valid linear inequalities that cut off fractional solutions without removing integer feasible points — are used to tighten relaxations and reduce the tree. Gomory cuts, derived from the simplex tableau by Ralph Gomory in 1958, were the first systematic cutting plane family. Modern branch-and-cut solvers combine cutting planes with branching in a sophisticated way, and they can routinely solve ILPs with hundreds of thousands of variables.
Combinatorial optimization concerns problems defined over discrete structures — graphs, sets, sequences — where the feasible set is finite but astronomically large. Classic examples include the traveling salesman problem (TSP) — find the shortest Hamiltonian cycle through cities — the knapsack problem — select items to maximize value within a weight budget — and the maximum cut problem — partition vertices of a graph to maximize edges between parts. Many combinatorial problems are NP-hard, meaning no polynomial-time algorithm is known. For such problems, two approaches prevail: approximation algorithms that find solutions with provable quality guarantees, and metaheuristics — algorithms like simulated annealing, genetic algorithms, and tabu search — that trade provable guarantees for practical performance on real instances.
Simulated annealing, introduced by Kirkpatrick, Gelatt, and Vecchi in 1983, mimics the physical process of slow cooling: at each step a random neighbor is sampled, and worse solutions are accepted with probability where is a temperature parameter that decreases over time. This controlled randomness allows the method to escape local optima. The connection to statistical physics is not merely metaphorical — the algorithm converges to the global optimum if the temperature is decreased slowly enough (following a logarithmic cooling schedule), a guarantee proved using the theory of Markov chains.
Dynamic Programming and Optimal Control
Dynamic programming (DP) is a strategy for solving optimization problems that have optimal substructure: an optimal solution to the whole problem can be built from optimal solutions to subproblems. The term and the systematic theory were developed by Richard Bellman in the 1950s. Bellman articulated the principle of optimality: regardless of the initial state and decision, the remaining decisions must constitute an optimal policy for the state resulting from the first decision. This principle leads to the Bellman equation, which expresses the optimal value function recursively.
For a discrete-time optimal control problem — choose a sequence of controls to minimize subject to dynamics — the Bellman equation is
with boundary condition . This is solved backwards in time: compute , then , and so on. The optimal control at each time is the minimizer on the right-hand side. Dynamic programming is exact but suffers from the curse of dimensionality — the state space grows exponentially in dimension, making exact DP computationally feasible only for low-dimensional state spaces.
For continuous-time problems, the analogous object is the Hamilton-Jacobi-Bellman (HJB) equation, a nonlinear PDE satisfied by the value function:
An alternative to the HJB approach is Pontryagin’s maximum principle, proved by Lev Pontryagin and collaborators in 1956. Instead of a PDE for the value function, the maximum principle gives necessary conditions in the form of a two-point boundary value problem: introduce a costate variable satisfying adjoint dynamics, and the optimal control satisfies a pointwise minimization condition at each time . For the important special case of linear quadratic (LQ) control — linear dynamics and quadratic cost — the optimal feedback law is linear, and the gain matrix solves the Riccati differential equation. LQ control and its infinite-horizon version (leading to the algebraic Riccati equation) are the bedrock of modern control engineering.
Stochastic and Machine Learning Optimization
When the objective function is not known analytically but only through noisy evaluations — as when for a random variable — we enter the domain of stochastic optimization. The foundational algorithm is stochastic gradient descent (SGD), introduced by Herbert Robbins and Sutton Monro in 1951 in the context of stochastic approximation. At each step, instead of computing the full gradient , one draws a single sample and updates
The step sizes must satisfy the Robbins-Monro conditions and to ensure convergence. SGD is computationally cheap per iteration — gradient evaluations rather than — and this makes it the algorithm of choice for training machine learning models on large datasets, where the full gradient would require a pass over millions of examples.
Variance reduction methods improve on plain SGD by periodically computing a full gradient to use as a control variate. SVRG (stochastic variance reduced gradient), proposed by Johnson and Zhang in 2013, achieves linear convergence for strongly convex objectives — the same asymptotic rate as full gradient descent but at a per-iteration cost comparable to SGD. Adam (adaptive moment estimation), introduced by Kingma and Ba in 2015, maintains per-coordinate running estimates of the first and second moments of gradients, adapting the step size automatically. Adam and its variants dominate modern deep learning practice, though their theoretical convergence properties on nonconvex objectives remain an active research area.
The optimization landscape of deep neural networks is nonconvex, high-dimensional, and not analytically tractable. Yet in practice, SGD and its adaptive variants reliably find solutions that generalize well to unseen data. Understanding why this is so — why local minima found by SGD are good, what role implicit regularization plays, and how overparameterization affects the loss landscape — is one of the central open questions at the intersection of optimization and machine learning. The neural tangent kernel framework of Jacot, Gabriel, and Hongler (2018) showed that infinitely wide neural networks trained by gradient descent behave like kernel machines, providing one path toward theoretical understanding. For finite networks, landscape analysis exploiting symmetry and overparameterization has explained the absence of bad local minima in certain architectures, but a complete theory remains elusive.
Multi-objective optimization addresses problems with several competing objectives, where no single solution is best across all criteria. The solution concept is Pareto optimality: a point is Pareto optimal if there is no feasible that improves at least one objective without worsening any other. The set of all Pareto optimal points forms the Pareto front, a curve or surface trading off objectives against one another. Applications range from engineering design (balancing weight and strength) to economics (allocating resources among agents) to machine learning (balancing accuracy and model size). Scalarization methods convert multi-objective problems to single-objective ones by forming a weighted sum of objectives, tracing out the Pareto front by varying weights. Evolutionary algorithms such as NSGA-II maintain a population of diverse solutions and are widely used when the Pareto front has complex geometry.
The remarkable breadth of optimization — from the clean geometry of linear programs to the noisy landscapes of deep learning training — reflects its status as the engine of modern applied mathematics. Wherever a decision must be made, a system must be designed, or a model must be fit, optimization is at work. Its theorems provide the guarantees; its algorithms provide the tools; and its ongoing synthesis with probability, statistics, and computation continues to define the frontier of what can be computed and what can be understood.