Robotics

The design and control of physical agents — perception, planning, manipulation, locomotion, and human-robot interaction.


Robotics is the science and engineering of designing, building, and controlling machines that interact with the physical world through sensing, computation, and actuation. Unlike purely software-based AI, a robot must contend with gravity, friction, sensor noise, actuator limits, and the unforgiving demand for real-time response. The field draws on mechanics, control theory, computer vision, and machine learning, and it spans applications from industrial assembly lines and surgical suites to autonomous vehicles and planetary exploration.

Kinematics and Coordinate Transformations

The geometry of robot motion begins with kinematics — the study of motion without regard to the forces that cause it. A robot manipulator is typically modeled as a chain of rigid links connected by joints, each of which adds one degree of freedom (DOF). A revolute joint allows rotation; a prismatic joint allows translation. The configuration of the entire robot is described by its joint variables — the angles or displacements at each joint — which together form the configuration space or joint space.

Forward kinematics computes the position and orientation of the robot’s end-effector (the tool, gripper, or hand at the tip of the arm) given the joint variables. The standard method uses the Denavit-Hartenberg (DH) convention, introduced by Jacques Denavit and Richard Hartenberg in 1955, which assigns a coordinate frame to each link and parameterizes the transformation between consecutive frames using four numbers: the link length aa, the link twist α\alpha, the joint offset dd, and the joint angle θ\theta. The forward kinematic function is then the product of these transformation matrices:

T0n=i=1nAi(θi)T_0^n = \prod_{i=1}^{n} A_i(\theta_i)

where each AiA_i is a 4×44 \times 4 homogeneous transformation matrix encoding both rotation and translation. Rotation matrices (elements of the special orthogonal group SO(3)SO(3)) describe orientation, while the full transformation lives in SE(3)SE(3), the special Euclidean group. Alternative orientation representations include Euler angles (intuitive but subject to gimbal lock — a singularity where two axes align and one degree of freedom is lost) and quaternions (a four-dimensional representation that avoids gimbal lock and interpolates smoothly).

Inverse kinematics (IK) solves the opposite problem: given a desired end-effector pose, find the joint variables that achieve it. IK is generally much harder than forward kinematics. For simple manipulators (e.g., a 6-DOF arm with a spherical wrist), closed-form geometric or algebraic solutions exist. For more complex or redundant robots, numerical methods based on the Jacobian matrix are used. The Jacobian J(θ)J(\theta) relates joint velocities to end-effector velocities: x˙=J(θ)θ˙\dot{x} = J(\theta)\dot{\theta}. Inverting this relationship (via the pseudoinverse or the damped least-squares method to handle singularities) provides an iterative approach to solving IK. Singularities — configurations where the Jacobian loses rank — are fundamental obstacles: at a singularity, the robot loses the ability to move in certain directions, and joint velocities may become unbounded for small task-space motions. The manipulability measure det(JJT)\sqrt{\det(JJ^T)} quantifies how far the robot is from a singularity and how uniformly it can move in all directions.

Dynamics and Control

Dynamics adds forces and torques to the kinematic picture, relating joint accelerations to the torques applied by the actuators. The equations of motion for a manipulator take the general form:

M(θ)θ¨+C(θ,θ˙)θ˙+g(θ)=τM(\theta)\ddot{\theta} + C(\theta, \dot{\theta})\dot{\theta} + g(\theta) = \tau

where M(θ)M(\theta) is the inertia matrix, C(θ,θ˙)C(\theta, \dot{\theta}) captures Coriolis and centrifugal effects, g(θ)g(\theta) is the gravity vector, and τ\tau is the vector of joint torques. Two classical formulations derive these equations: the Newton-Euler recursive method, which propagates velocities and accelerations outward from the base and forces inward from the tip, and the Lagrangian formulation, which derives the equations from the difference of kinetic and potential energy. The Newton-Euler method is more computationally efficient (O(n)O(n) in the number of joints) and is preferred for real-time control.

PID control (Proportional-Integral-Derivative) is the most widely used control law in robotics and industry. The control signal is the sum of three terms: a proportional term that responds to the current error, an integral term that accumulates past error to eliminate steady-state offset, and a derivative term that anticipates future error based on the rate of change. PID is simple and effective for linear or mildly nonlinear systems, but it does not account for the coupling and nonlinearity of multi-joint robot dynamics.

Computed torque control (also called inverse dynamics control) uses the full dynamic model to cancel nonlinear terms, reducing the system to a set of decoupled linear systems that can be controlled independently. The control law τ=M(θ)θ¨d+C(θ,θ˙)θ˙+g(θ)+Kpe+Kde˙\tau = M(\theta)\ddot{\theta}_d + C(\theta, \dot{\theta})\dot{\theta} + g(\theta) + K_p e + K_d \dot{e} linearizes the closed-loop dynamics, where e=θdθe = \theta_d - \theta is the tracking error and KpK_p, KdK_d are gain matrices. This approach achieves high performance but requires an accurate dynamic model.

Impedance control, introduced by Neville Hogan in 1985, shifts the paradigm from controlling position to controlling the relationship between force and motion. Rather than commanding the robot to follow a trajectory exactly, impedance control makes the robot behave as if it were a mass-spring-damper system interacting with the environment. This is essential for tasks involving contact — assembly, polishing, human-robot collaboration — where rigid position control would generate dangerous forces. Model predictive control (MPC) optimizes a sequence of future control actions by simulating the system forward over a finite horizon, subject to constraints on states and inputs, and is increasingly used for complex robots like legged machines and drones.

Sensors, Actuators, and Sensor Fusion

A robot’s ability to perceive and act depends on its sensors and actuators. Proprioceptive sensors measure the robot’s own state: encoders measure joint angles, tachometers measure joint velocities, accelerometers and gyroscopes (combined in an inertial measurement unit, or IMU) measure linear acceleration and angular velocity. Exteroceptive sensors perceive the external environment: cameras provide rich visual information, LiDAR (Light Detection and Ranging) produces dense 3D point clouds, radar measures distance and velocity using radio waves, ultrasonic sensors measure proximity, and force/torque sensors at the wrist or fingertips measure contact forces.

Actuators convert electrical, hydraulic, or pneumatic energy into mechanical motion. DC motors (brushed and brushless), stepper motors, and servo motors are the most common for robotic manipulators. Hydraulic actuators provide high force density and are used in heavy-duty industrial robots and construction equipment. Series elastic actuators (SEAs), which incorporate a compliant element between the motor and the load, provide inherent force sensing and safer interaction with humans. Pneumatic actuators and shape-memory alloys are used in specialized applications including soft robotics.

Sensor fusion combines data from multiple sensors to produce estimates that are more accurate, more complete, and more robust than any single sensor alone. The Kalman filter, developed by Rudolf Kalman in 1960, is the optimal estimator for linear systems with Gaussian noise, recursively updating a state estimate and its uncertainty covariance as new measurements arrive. The extended Kalman filter (EKF) linearizes nonlinear models around the current estimate, making it applicable to robot localization and navigation. Particle filters (sequential Monte Carlo methods) represent the posterior distribution over states as a set of weighted samples, handling arbitrary nonlinearities and multi-modal distributions at the cost of computational expense proportional to the number of particles. Multi-sensor fusion architectures typically combine IMU data (high frequency, accumulates drift) with camera or LiDAR data (lower frequency, drift-free) using one of these frameworks.

A mobile robot must navigate from its current location to a goal while avoiding obstacles. This requires solving two interrelated problems: localization (where am I?) and path planning (how do I get there?).

Path planning algorithms find collision-free paths through the robot’s configuration space. Grid-based methods discretize the space and apply graph search (A*, Dijkstra). Sampling-based methods are more scalable to high-dimensional spaces. The Rapidly-exploring Random Tree (RRT), introduced by Steven LaValle in 1998, grows a tree from the start configuration by repeatedly sampling random points and extending the nearest tree node toward them. RRT is probabilistically complete — it will find a path if one exists, given enough time — but the paths it produces are typically suboptimal. RRT* (Sertac Karaman and Emilio Frazzoli, 2011) adds a rewiring step that converges asymptotically to the optimal path. Probabilistic Roadmaps (PRM) build a graph of randomly sampled configurations connected by collision-free edges, and then search the graph; they are especially effective when many queries must be answered in the same environment.

Obstacle avoidance during execution uses reactive methods that complement the global planner. Potential fields, introduced by Oussama Khatib in 1986, model the goal as an attractive force and obstacles as repulsive forces, guiding the robot along the negative gradient of the combined potential. The method is simple and fast but can get stuck in local minima — configurations where the attractive and repulsive forces balance without the robot having reached the goal. The Dynamic Window Approach (DWA) searches over the space of velocities reachable within one control cycle, evaluating each candidate velocity for progress toward the goal, obstacle clearance, and speed, and selecting the best. Behavior-based architectures, pioneered by Rodney Brooks in the late 1980s, layer simple reactive behaviors (avoid obstacles, follow walls, seek goals) and let their interaction produce complex navigation without explicit planning.

SLAM: Simultaneous Localization and Mapping

The SLAM problem — building a map of an unknown environment while simultaneously localizing within it — is one of the defining challenges of mobile robotics. The difficulty is circular: accurate localization requires a good map, but building a good map requires accurate localization.

EKF-SLAM treats the robot pose and all landmark positions as a single state vector and estimates it with an extended Kalman filter. As new landmarks are observed, the state vector grows, and the full covariance matrix (which encodes correlations between the robot pose and every landmark) is updated. EKF-SLAM is elegant but scales poorly: maintaining the covariance matrix is O(n2)O(n^2) in the number of landmarks, making it impractical for large environments.

Graph-based SLAM (also called GraphSLAM) represents the problem as a pose graph where nodes are robot poses and edges encode constraints derived from odometry and observations. Loop closure detection — recognizing that the robot has returned to a previously visited location — adds constraints that correct accumulated drift. The optimal map is found by minimizing the total error across all constraints, a large nonlinear least-squares problem typically solved with iterative methods like Gauss-Newton or Levenberg-Marquardt.

Visual SLAM uses cameras as the primary sensor. ORB-SLAM (Raul Mur-Artal and colleagues, 2015) is a widely used feature-based system that extracts ORB features, tracks them across frames, estimates the camera trajectory, and builds a sparse 3D map. LSD-SLAM and DSO (Direct Sparse Odometry) operate directly on pixel intensities rather than extracted features, achieving dense or semi-dense reconstructions. More recently, deep learning has entered SLAM: neural networks estimate depth from monocular images, predict optical flow, and even learn entire SLAM pipelines end-to-end. LiDAR SLAM systems produce dense, metrically accurate 3D maps and are the standard for autonomous vehicles and large-scale mapping.

Manipulation and Grasping

Manipulation — the ability to interact with and change the state of objects — is what distinguishes a robot from a mere sensor platform. Grasping is the gateway problem: before a robot can assemble, sort, or use objects, it must be able to pick them up reliably.

A grasp is characterized by the contact points between the gripper and the object and the forces applied at those points. A grasp achieves force closure if the contacts can resist any external wrench (force and torque) applied to the object, ensuring a firm hold regardless of disturbances. Form closure is the stronger geometric condition that the object cannot move at all, even without friction. Grasp quality metrics quantify how robustly a grasp can resist disturbances — common measures include the radius of the largest ball inscribed in the set of achievable wrenches.

Analytic grasp planning computes grasps from geometric models of the object and the gripper, optimizing for force closure and quality metrics. Data-driven grasp planning uses machine learning to predict grasp success from visual input, typically RGB-D images. Networks like GraspNet and Dex-Net (developed by Ken Goldberg’s group at Berkeley) train on millions of simulated grasps and transfer the learned policies to real robots. Dexterous manipulation — controlling objects with multi-fingered hands through in-hand manipulation, pivoting, and regrasping — remains one of the field’s hardest open problems, requiring simultaneous reasoning about contact dynamics, friction, and fine motor control.

Task and motion planning (TAMP) integrates high-level symbolic planning (which objects to grasp, in what order, where to place them) with low-level motion planning (collision-free arm trajectories, stable grasps). TAMP is essential for complex manipulation tasks like cooking, assembly, and tidying, where the combinatorial space of possible action sequences interacts with the continuous space of robot motions.

Human-Robot Interaction and Emerging Frontiers

As robots move out of factories and into homes, hospitals, and public spaces, human-robot interaction (HRI) becomes critical. A robot that works alongside humans must be safe (ISO standards limit forces and speeds for collaborative robots, or “cobots”), predictable (its behavior should be legible and anticipatable), and communicative (it should signal its intentions through motion, gaze, speech, or gesture).

Shared autonomy blends human and robot control: the human provides high-level intent while the robot handles low-level execution, or the robot acts autonomously but defers to the human when uncertain. Learning from demonstration (LfD) allows non-expert users to teach robots new tasks by showing them what to do, rather than programming explicit instructions. Behavioral cloning directly maps observed states to actions; inverse reinforcement learning infers the demonstrator’s reward function and then optimizes a policy to maximize it. Preference-based learning refines robot behavior by asking humans to compare candidate behaviors rather than provide demonstrations, and it underlies the RLHF methods used in language model alignment.

Social robotics addresses robots designed for direct social interaction — assistive robots for the elderly, educational robots for children, and service robots in hospitality. These robots must recognize and respond to human emotions, obey social norms, and adapt their behavior to individual users. The design of a social robot’s appearance, voice, and gestures profoundly affects how humans perceive and interact with it — the uncanny valley hypothesis, proposed by Masahiro Mori in 1970, predicts that robots that are almost but not quite human-like elicit a feeling of unease.

Swarm robotics takes inspiration from biological collectives — ant colonies, bird flocks, fish schools — to coordinate large numbers of simple robots through local interactions and decentralized control. Emergent collective behaviors like flocking, foraging, and formation control arise from simple rules without any central coordinator, providing robustness and scalability. Soft robotics builds robots from compliant materials — silicone, rubber, fabric — that deform continuously rather than relying on rigid links and joints. Soft robots are inherently safe for human interaction, can squeeze through narrow spaces, and can conform to irregularly shaped objects, but their infinite-dimensional deformations make modeling and control far more challenging than for rigid robots.

Autonomous vehicles represent perhaps the most ambitious application of robotics, requiring the integration of perception (cameras, LiDAR, radar), localization (GPS, SLAM), prediction (forecasting the behavior of other road users), planning (behavior planning and trajectory optimization), and control — all operating in real time in an open, dynamic, and safety-critical environment. Learning-based robotics more broadly is transforming the field: reinforcement learning trains legged robots to traverse rough terrain, sim-to-real transfer bridges the gap between simulated and physical environments, and foundation models for robotics promise general-purpose manipulation skills learned from internet-scale data. The convergence of advances in hardware, sensing, computation, and AI is bringing the decades-old vision of truly autonomous, intelligent machines closer to reality than at any point in the history of the field.