Computer Vision
Teaching machines to interpret visual information — image processing, object detection, scene understanding, and deep learning for vision.
Computer vision is the science of enabling machines to extract meaningful information from images and video — to see, in the broadest sense, the way biological organisms do. Where computer graphics synthesizes images from mathematical models, computer vision inverts that process, recovering structure, identity, motion, and semantics from raw pixel data. The field draws on optics, linear algebra, probability, and increasingly on deep learning, and its applications span autonomous driving, medical diagnosis, surveillance, robotics, and the generative models reshaping creative industries.
Image Formation and Camera Geometry
Every vision system begins with the physics of image formation — how a three-dimensional scene is projected onto a two-dimensional sensor. The foundational model is the pinhole camera, which projects a point in world coordinates to a point on the image plane according to the relation , , where is the focal length. In homogeneous coordinates, this projection is expressed compactly as a projection matrix such that , where tildes denote homogeneous representations. The projection matrix decomposes as , separating the intrinsic parameters (focal length, principal point, pixel aspect ratio) from the extrinsic parameters (the camera’s rotation and translation relative to the world). Real lenses introduce distortion — radial distortion causes straight lines to appear curved, and tangential distortion arises from imperfect lens alignment — modeled by polynomial correction terms applied to the normalized image coordinates.
Camera calibration determines these parameters from known correspondences between world points and image points. The dominant technique, introduced by Zhengyou Zhang in 2000, uses multiple images of a planar checkerboard pattern. Each image provides a homography relating the pattern plane to the image, and from three or more homographies the intrinsic and extrinsic parameters can be recovered by solving a system of linear constraints followed by nonlinear refinement. Zhang’s method is both flexible and accurate, requiring no specialized equipment, and it has been implemented in virtually every vision library. Beyond classical cameras, modern depth-sensing technologies — structured light (projecting known patterns and analyzing their deformation), time-of-flight sensors (measuring the round-trip time of infrared pulses), and LiDAR (scanning laser rangefinding) — directly measure the third dimension, providing dense or sparse depth maps that complement or replace stereo reconstruction.
The relationship between world geometry and image geometry becomes especially rich when two cameras observe the same scene. Epipolar geometry describes the constraints: a point in one image constrains the corresponding point in the other to lie on a line, the epipolar line. This constraint is captured by the fundamental matrix , a rank-2 matrix satisfying for corresponding points. When the cameras are calibrated, the fundamental matrix simplifies to the essential matrix , which encodes only the relative rotation and translation and can be decomposed into and (up to scale). These results, formalized by Olivier Faugeras and others in the 1990s, are the mathematical foundation of stereo vision, structure from motion, and visual SLAM.
Image Processing and Filtering
Before higher-level analysis, images typically undergo preprocessing to enhance relevant features and suppress noise. An image is a discrete function — or for color images with channels — and the basic operation on images is convolution with a kernel :
A Gaussian filter with standard deviation smooths the image, suppressing noise while blurring edges. The Gaussian is separable — a 2D Gaussian convolution can be decomposed into two successive 1D convolutions, reducing computational cost from to per pixel, where is the kernel width. The convolution theorem relates spatial convolution to multiplication in the frequency domain: , where denotes the Fourier transform. This duality — between the spatial and frequency domains — is a recurring theme: low-pass filters blur, high-pass filters sharpen, and band-pass filters select features at particular scales.
Edge detection identifies locations where image intensity changes abruptly, corresponding to object boundaries, texture changes, or shadows. The simplest approach computes image gradients using finite difference operators like the Sobel or Prewitt kernels, which estimate partial derivatives and . The gradient magnitude indicates edge strength, and the gradient direction indicates edge orientation. The Canny edge detector, proposed by John Canny in 1986 and still widely used, refines this by applying non-maximum suppression (thinning edges to single-pixel width) and hysteresis thresholding (using two thresholds to link strong and weak edge pixels), producing clean, connected edge maps. Histogram equalization redistributes pixel intensities to span the full dynamic range, improving contrast in images that are too dark or too light, while more sophisticated techniques like CLAHE (Contrast Limited Adaptive Histogram Equalization) apply equalization locally to avoid amplifying noise.
Feature Detection, Description, and Matching
Identifying and matching distinctive points across images is a cornerstone of vision, enabling panorama stitching, 3D reconstruction, object recognition, and visual tracking. A good feature detector finds image locations that are stable under changes in viewpoint, scale, and illumination.
The Harris corner detector, introduced by Chris Harris and Mike Stephens in 1988, identifies corners by analyzing the structure tensor (also called the second moment matrix) of image gradients in a local window. For a point , the structure tensor is:
where and are image derivatives and is a Gaussian weighting function. A corner is a point where both eigenvalues of are large — meaning the image intensity changes significantly in all directions. Harris and Stephens proposed a corner response function that avoids explicit eigenvalue computation, where is an empirical constant typically around 0.04 to 0.06.
Harris corners are not scale-invariant. David Lowe’s SIFT (Scale-Invariant Feature Transform), published in 2004, addressed this by detecting features across a scale space built from differences of Gaussians (DoG) — approximations to the Laplacian of Gaussian that efficiently identify blob-like structures at multiple scales. Each detected keypoint is assigned a canonical orientation from the local gradient histogram, and a 128-dimensional descriptor is computed from gradient orientations in a grid of subregions around the keypoint. SIFT descriptors are remarkably invariant to scale, rotation, and moderate viewpoint changes, and they remained the gold standard for over a decade. SURF (Speeded-Up Robust Features), introduced by Herbert Bay and colleagues in 2006, accelerated SIFT-like detection using integral images and Haar wavelet responses. Binary descriptors like BRIEF, ORB, and BRISK replaced floating-point gradient histograms with binary strings, enabling matching via Hamming distance rather than Euclidean distance and achieving order-of-magnitude speedups suitable for real-time applications.
Feature matching pairs descriptors between images using nearest-neighbor search in descriptor space. Lowe’s ratio test discards ambiguous matches by requiring the distance to the nearest neighbor to be significantly smaller than the distance to the second nearest. After matching, RANSAC (Random Sample Consensus), proposed by Martin Fischler and Robert Bolles in 1981, robustly estimates geometric models (homographies, fundamental matrices) from correspondences contaminated by outliers. RANSAC repeatedly samples minimal subsets of matches, fits a model, counts inliers, and keeps the model with the most support — a simple algorithm that has proven indispensable across all of computer vision.
Geometric Reconstruction and 3D Vision
Recovering three-dimensional structure from two-dimensional images is one of the grand challenges of computer vision. Stereo vision — using two cameras with known relative position — computes depth by triangulation: if a point is observed at pixel in the left image and in the right image, the disparity is inversely proportional to depth, , where is the baseline (distance between cameras) and is the focal length. The core difficulty is the correspondence problem — determining which pixel in the left image matches which pixel in the right. Classical stereo algorithms use local windows of pixel intensities, global optimization methods like graph cuts or belief propagation, or semi-global matching (SGM) to produce dense disparity maps.
Structure from Motion (SfM) recovers both camera poses and 3D scene structure from an unordered collection of photographs. The typical pipeline, refined over decades by researchers including Richard Hartley and Andrew Zisserman, proceeds incrementally: start with a two-view reconstruction from the essential matrix, add new images one at a time by solving the Perspective-n-Point (PnP) problem to determine each new camera’s pose, triangulate new 3D points, and periodically refine everything through bundle adjustment — a large-scale nonlinear least-squares optimization that minimizes the reprojection error across all cameras and points simultaneously. Bundle adjustment, often solved with the Levenberg-Marquardt algorithm exploiting the sparse structure of the Jacobian, is the computational backbone of SfM and produces the accurate reconstructions used in mapping, heritage preservation, and visual effects. Modern systems like COLMAP handle thousands of images and millions of 3D points.
SLAM (Simultaneous Localization and Mapping) extends these ideas to real-time sequential settings, where a camera moves through an unknown environment and must simultaneously track its own pose and build a map. Visual SLAM systems like ORB-SLAM fuse feature tracking, local bundle adjustment, loop closure detection, and global pose graph optimization into unified frameworks that run in real time on consumer hardware. The integration of inertial measurement units (IMUs) with visual tracking — Visual-Inertial Odometry — provides robustness to rapid motion and textureless scenes, enabling applications from augmented reality to drone navigation.
Object Detection and Recognition
Identifying what objects are present in an image and where they are located is the detection problem — one of the most commercially impactful applications of computer vision. The field’s evolution from hand-crafted features to deep learning over the past two decades illustrates the broader transformation of AI.
Classical detection relied on sliding a fixed-size window across the image at multiple scales and classifying each window. Viola and Jones’s face detector (2001) used Haar-like features evaluated efficiently with integral images and a cascade of boosted classifiers, achieving real-time face detection and demonstrating that vision systems could operate on consumer hardware. Histograms of Oriented Gradients (HOG), introduced by Navneet Dalal and Bill Triggs in 2005, encoded local gradient distributions and, combined with support vector machines, became the standard for pedestrian detection. Deformable part models (DPMs), developed by Pedro Felzenszwalb and colleagues, extended HOG by modeling objects as collections of parts with learned spatial relationships, capturing the variability of non-rigid objects.
The deep learning revolution in object detection began with R-CNN (Regions with CNN features), proposed by Ross Girshick and colleagues in 2014. R-CNN extracted region proposals using selective search, classified each region with a convolutional neural network pre-trained on ImageNet, and achieved a dramatic improvement over prior methods on the PASCAL VOC benchmark. Fast R-CNN (2015) eliminated redundant computation by sharing convolutional features across proposals, and Faster R-CNN (2015) replaced selective search with a learned Region Proposal Network (RPN), making the entire pipeline end-to-end trainable and fast enough for practical applications. The anchor box mechanism in Faster R-CNN — predicting offsets relative to predefined bounding boxes of various sizes and aspect ratios — became a standard technique.
Single-stage detectors eliminated the proposal generation step entirely. YOLO (You Only Look Once), introduced by Joseph Redmon and colleagues in 2016, divided the image into a grid and predicted bounding boxes and class probabilities directly from each grid cell in a single forward pass, achieving real-time detection. Successive versions — YOLOv2 through YOLOv8 and beyond — incorporated batch normalization, multi-scale feature maps, anchor-free prediction, and architectural improvements to achieve state-of-the-art accuracy at interactive speeds. SSD (Single Shot MultiBox Detector) adopted a similar philosophy with multi-scale feature maps, and RetinaNet addressed the foreground-background class imbalance in single-stage detectors with focal loss, which down-weights well-classified examples and focuses training on hard negatives. More recently, vision transformers have been adapted for detection, with architectures like DETR (DEtection TRansformer) reformulating detection as a set prediction problem using attention mechanisms and bipartite matching loss, eliminating the need for anchor boxes and non-maximum suppression entirely.
Semantic and Instance Segmentation
While detection localizes objects with bounding boxes, segmentation classifies every pixel in the image — a finer-grained understanding of visual scenes. Semantic segmentation assigns a class label to each pixel, grouping all pixels belonging to the same category (all road pixels, all sky pixels) without distinguishing individual instances. Instance segmentation goes further, distinguishing individual objects of the same class (this car versus that car).
The modern era of semantic segmentation began with Fully Convolutional Networks (FCNs), introduced by Jonathan Long, Evan Shelhamer, and Trevor Darrell in 2015. By replacing the fully connected layers of a classification network with convolutional layers and adding learned upsampling (transposed convolutions), FCNs produced dense per-pixel predictions while retaining the ability to process images of arbitrary size. Skip connections from earlier layers preserved fine spatial detail that coarse feature maps lost during downsampling.
The encoder-decoder architecture became the dominant paradigm. U-Net, developed by Olaf Ronneberger and colleagues in 2015 for medical image segmentation, paired a contracting path (encoder) with a symmetric expanding path (decoder) connected by skip connections at each resolution level, producing precise segmentation masks even with limited training data. SegNet adopted a similar structure, using pooling indices from the encoder to guide upsampling in the decoder. DeepLab, developed at Google, introduced atrous (dilated) convolutions — convolutions with gaps between kernel elements — to enlarge the receptive field without increasing the number of parameters or reducing resolution. DeepLabv3 combined atrous convolutions at multiple rates with Atrous Spatial Pyramid Pooling (ASPP) to capture multi-scale context, achieving strong results across multiple benchmarks. Post-processing with conditional random fields (CRFs) refined segmentation boundaries by incorporating low-level image cues like color similarity.
Instance segmentation was unified by Mask R-CNN, introduced by Kaiming He and colleagues in 2017. Mask R-CNN extends Faster R-CNN by adding a parallel branch that predicts a binary segmentation mask for each detected region of interest. The key innovation, RoIAlign, replaced the quantized pooling of previous methods with bilinear interpolation, preserving spatial precision. Panoptic segmentation, proposed by Alexander Kirillov and colleagues in 2019, unified semantic and instance segmentation into a single task: every pixel receives both a class label and an instance ID (for “things” like cars and people) or is labeled as “stuff” (sky, road, grass). Panoptic architectures typically combine separate semantic and instance branches with a merging module that resolves conflicts.
Deep Architectures and Representation Learning
The deep learning revolution in vision rests on convolutional neural networks (CNNs), whose architecture — learned local filters applied across the spatial extent of an image — mirrors the hierarchical, translation-invariant processing observed in biological visual cortex. The decisive moment was AlexNet, which won the ImageNet Large Scale Visual Recognition Challenge in 2012 by a large margin, demonstrating that deep CNNs trained on GPUs could dramatically outperform hand-engineered features. VGGNet (2014) showed that depth matters — stacking many convolutional layers outperformed shallower networks with larger kernels. GoogLeNet/Inception (2014) introduced the Inception module, processing the same input with multiple kernel sizes in parallel and concatenating the results. ResNet (2015), by Kaiming He and colleagues, solved the degradation problem — the paradox that deeper networks trained worse — by introducing residual connections (skip connections) that allow gradients to flow directly through the network, enabling training of architectures with hundreds of layers.
Vision Transformers (ViTs), introduced by Alexey Dosovitskiy and colleagues in 2020, challenged the dominance of convolutions by applying the transformer architecture — originally developed for natural language processing — directly to images. An image is divided into fixed-size patches, each linearly embedded into a token, and the sequence of tokens is processed by standard transformer encoder layers with self-attention. Despite lacking the inductive biases of convolutions (locality, translation invariance), ViTs achieved competitive or superior performance when trained on large datasets, suggesting that sufficient data can compensate for architectural priors. Hybrid architectures like Swin Transformer reintroduce locality through windowed attention and hierarchical feature maps, achieving state-of-the-art results with improved efficiency.
Transfer learning has been essential to practical vision systems. Pre-training a deep network on a large dataset (ImageNet, with its 1.2 million labeled images across 1000 classes) and fine-tuning it on a smaller target dataset yields dramatically better results than training from scratch, because early layers learn general features — edges, textures, color patterns — that transfer across tasks. Self-supervised methods like contrastive learning (SimCLR, MoCo) and masked image modeling (MAE) learn visual representations without labeled data by solving pretext tasks, and foundation models like CLIP (Contrastive Language-Image Pre-training) learn joint visual-textual representations from hundreds of millions of image-text pairs, enabling zero-shot classification and forming the backbone of modern text-to-image generation systems.
Generative Models, Video Understanding, and Frontiers
Generative models have expanded computer vision from analysis to synthesis. Generative Adversarial Networks (GANs), introduced by Ian Goodfellow in 2014, train a generator and a discriminator in a minimax game: the generator learns to produce images indistinguishable from real data, while the discriminator learns to tell them apart. Progressive training, style-based architectures (StyleGAN), and conditional variants (pix2pix, CycleGAN) produced increasingly photorealistic and controllable image synthesis. Variational Autoencoders (VAEs) frame generation as probabilistic inference, encoding images into a latent distribution and decoding samples. Diffusion models, which learn to reverse a gradual noising process, have surpassed GANs in image quality and diversity. Systems like DALL-E, Stable Diffusion, and Midjourney use diffusion models conditioned on text prompts to generate high-fidelity images from natural language descriptions, transforming creative workflows and raising profound questions about authenticity and intellectual property.
Video understanding extends vision from single frames to temporal sequences. Optical flow — the apparent motion of pixels between frames — is estimated by algorithms ranging from the classical Lucas-Kanade and Horn-Schunck methods to modern deep learning approaches like FlowNet and RAFT. Action recognition classifies human activities in video: two-stream networks process appearance (RGB) and motion (optical flow) in parallel; 3D CNNs like C3D and I3D apply convolutions along the temporal dimension; and video transformers attend to spatiotemporal tokens. Temporal action detection localizes the start and end of actions within untrimmed video, while video object segmentation tracks and segments specific objects across frames, maintaining identity through occlusions and appearance changes.
Neural Radiance Fields (NeRFs), introduced by Ben Mildenhall and colleagues in 2020, represent a scene as a continuous volumetric function mapping spatial position and viewing direction to color and density, parameterized by a neural network. Novel views are synthesized by volume rendering — casting rays through the scene, sampling points along each ray, querying the network, and compositing colors and opacities. NeRFs produce photorealistic novel views from sparse input images and have spawned an enormous research ecosystem addressing speed (instant NGP, 3D Gaussian splatting), dynamic scenes, relighting, and editing. The broader field of 3D vision continues to advance with point cloud processing networks like PointNet and PointNet++, which consume raw 3D coordinates and learn per-point and global features for classification, segmentation, and detection in three dimensions.
At the frontier, multimodal foundation models like CLIP, GPT-4V, and Gemini integrate visual perception with language understanding, enabling visual question answering, image captioning, and reasoning about visual content through natural language. Embodied vision connects perception to action in robotic systems, where a visual policy must guide manipulation, navigation, and interaction in the physical world. Self-supervised learning at scale promises to learn general visual representations from the vast supply of unlabeled images and video, potentially surpassing the supervised paradigm that has dominated the field since AlexNet. These developments point toward vision systems that are not merely classifiers or detectors but genuine visual reasoners — machines that understand the physical world through seeing.