Data Structures

The organization and storage of data for efficient access and modification — arrays, trees, graphs, hash tables, and beyond.


A data structure is a systematic way of organizing, storing, and accessing data so that specific operations can be performed efficiently. The choice of data structure is often the single most important decision in algorithm design: the difference between a linear scan and a logarithmic lookup, between an intractable problem and a practical one, frequently comes down to how the data is arranged. The study of data structures is, at its core, the study of the tradeoffs between time, space, and the complexity of operations.

Fundamental Sequential Structures

The simplest data structures store elements in a linear sequence. An array is a contiguous block of memory in which each element is accessed by its index in O(1)O(1) time. This constant-time access arises from the arithmetic of memory addresses: if the array starts at address aa and each element occupies ss bytes, the element at index ii lives at address a+isa + i \cdot s. Static arrays have a fixed size determined at creation; dynamic arrays (often called vectors) grow as needed by allocating a larger block and copying the existing elements. The standard strategy doubles the capacity when the array is full, so that although an individual insertion may cost O(n)O(n) to copy, the amortized cost per insertion is O(1)O(1) — a classic application of amortized analysis. Arrays enjoy excellent cache locality: because elements are stored contiguously, accessing one element often brings nearby elements into the processor cache, making sequential scans extremely fast on modern hardware.

A linked list stores elements in nodes that are scattered throughout memory, each node containing a value and a pointer to the next node. Singly linked lists support efficient insertion and deletion at the head in O(1)O(1) time, but accessing the kk-th element requires traversing kk nodes, costing O(k)O(k). Doubly linked lists add a pointer to the previous node, enabling O(1)O(1) deletion of a node given a reference to it and efficient traversal in both directions. Circular linked lists connect the last node back to the first, useful for round-robin scheduling and buffer management. The use of a sentinel node (a dummy node at the head or tail) simplifies boundary conditions by eliminating special cases for empty lists and endpoints.

Stacks and queues are abstract data types defined by their access discipline. A stack follows the last-in, first-out (LIFO) principle: elements are added and removed from the same end. Stacks are implemented efficiently with either arrays or linked lists and are fundamental to expression evaluation (converting infix to postfix notation), function call management (the call stack), backtracking algorithms, and the iterative simulation of recursion via explicit stack-based depth-first search. A queue follows the first-in, first-out (FIFO) principle. Array-based circular queues wrap around the end of the array to reuse freed space, maintaining O(1)O(1) enqueue and dequeue operations. Deques (double-ended queues) support insertion and removal at both ends. Queues drive breadth-first search, event-driven simulation, and the scheduling of processes in operating systems.

Trees and Balanced Search Trees

Trees generalize linear structures into hierarchies. A binary search tree (BST) stores keys so that for every node, all keys in its left subtree are smaller and all keys in its right subtree are larger. This invariant enables search, insertion, and deletion in O(h)O(h) time, where hh is the height of the tree. In the best case, a balanced BST has height O(logn)O(\log n); in the worst case (a degenerate, linear chain), the height is O(n)O(n). The challenge of maintaining balance has inspired some of the most elegant data structures in computer science.

AVL trees, invented by Georgy Adelson-Velsky and Evgenii Landis in 1962, were the first self-balancing BSTs. An AVL tree maintains the invariant that the heights of the two child subtrees of every node differ by at most one. When an insertion or deletion violates this invariant, the tree is rebalanced using rotations — local restructuring operations that preserve the BST property while restoring balance. There are four cases (left-left, left-right, right-left, right-right), each handled by one or two rotations. All operations run in O(logn)O(\log n) worst-case time.

Red-black trees, introduced by Rudolf Bayer in 1972 and refined by Leonidas Guibas and Robert Sedgewick in 1978, relax the balance condition by coloring each node red or black and enforcing five properties: the root is black, every leaf (null pointer) is black, the children of a red node are both black, every path from the root to a leaf contains the same number of black nodes, and new insertions are red. These properties ensure that the longest path is at most twice the longest, guaranteeing O(logn)O(\log n) operations. Red-black trees are the workhorse balanced BST in practice, underlying the standard library map and set implementations in many programming languages.

B-trees, introduced by Rudolf Bayer and Edward McCreight in 1970, generalize BSTs to nodes with many children, specifically designed for storage systems where reading a block of data is much cheaper than seeking to a new location. A B-tree of order mm allows each node to have between m/2\lceil m/2 \rceil and mm children, keeping the tree shallow and minimizing disk accesses. B+ trees store all data in the leaves and link the leaves in a sorted chain, supporting efficient range queries. B-trees and their variants are the dominant indexing structure in databases and file systems, from SQLite to modern distributed storage engines.

Hash Tables and Hashing

A hash table provides expected O(1)O(1) time for insertion, deletion, and lookup by using a hash function hh to map keys to positions (called buckets or slots) in an array. The design of the hash function is critical: it should distribute keys uniformly across the array to minimize collisions (distinct keys mapping to the same position). The birthday paradox from probability theory tells us that collisions are inevitable once the number of keys exceeds roughly m\sqrt{m}, where mm is the number of buckets, so every hash table must have a collision resolution strategy.

Separate chaining stores all keys that hash to the same bucket in a linked list (or another secondary structure) at that position. The expected time per operation is O(1+α)O(1 + \alpha), where α=n/m\alpha = n/m is the load factor — the ratio of keys to buckets. When α\alpha exceeds a threshold (typically around 0.75), the table is resized and all keys are rehashed. Open addressing stores all keys directly in the array, probing alternative positions when a collision occurs. Linear probing checks consecutive positions, suffering from primary clustering (long runs of occupied slots) but benefiting from excellent cache performance. Quadratic probing and double hashing reduce clustering at the cost of slightly worse locality. Robin Hood hashing, proposed by Pedro Celis in 1986, equalizes probe lengths by displacing keys with shorter probe distances, improving worst-case lookup times.

Universal hashing, introduced by Larry Carter and Mark Wegman in 1979, addresses the concern that an adversary might choose inputs that cause many collisions for a fixed hash function. A universal hash family is a collection of hash functions such that for any two distinct keys, the probability that a randomly chosen function from the family maps them to the same bucket is at most 1/m1/m. This guarantee suffices for expected O(1)O(1) operations regardless of the input. Cuckoo hashing, invented by Rasmus Pagh and Flemming Friche Rodler in 2001, uses two hash functions and two tables, achieving worst-case O(1)O(1) lookup time (though insertion may require rehashing in rare cases). Perfect hashing constructs a hash function with no collisions for a known set of keys, enabling guaranteed O(1)O(1) lookup; it is used in compilers for keyword recognition and in read-only dictionaries.

Priority Queues and Heaps

A priority queue is an abstract data type that supports inserting elements with associated priorities and extracting the element with the highest (or lowest) priority. The standard implementation is the binary heap, a complete binary tree satisfying the heap property: every node’s key is at most (for a min-heap) the keys of its children. A binary heap can be stored compactly in an array, with the children of the node at index ii at indices 2i+12i + 1 and 2i+22i + 2. Insertion and extraction of the minimum both run in O(logn)O(\log n) time, and building a heap from nn unordered elements takes only O(n)O(n) time using the bottom-up heapify procedure — a fact that surprises many newcomers, since the naive analysis suggests O(nlogn)O(n \log n). Heap sort uses a heap to sort nn elements in O(nlogn)O(n \log n) time with O(1)O(1) auxiliary space, making it an in-place, asymptotically optimal comparison-based sort.

Fibonacci heaps, introduced by Michael Fredman and Robert Tarjan in 1987, support decrease-key in O(1)O(1) amortized time, a property that is critical for the best-known implementations of Dijkstra’s algorithm (O(m+nlogn)O(m + n \log n)) and Prim’s algorithm. The amortized analysis of Fibonacci heaps uses a potential function based on the number of trees and the number of marked nodes in the heap. While Fibonacci heaps are theoretically superior, their constant factors and complexity of implementation limit their practical advantage; pairing heaps, introduced by Fredman, Sedgewick, Sleator, and Tarjan in 1986, offer a simpler alternative with conjectured (but not fully proved) similar amortized bounds. Binomial heaps support merging two heaps in O(logn)O(\log n) time and serve as a stepping stone to understanding Fibonacci heaps. D-ary heaps generalize binary heaps by allowing each node to have dd children, reducing the tree height to O(logdn)O(\log_d n) at the cost of more comparisons per level; choosing dd to match the cache line size can improve practical performance significantly.

String and Spatial Data Structures

Strings — sequences of characters — are ubiquitous in computing, and specialized data structures enable efficient string processing. A trie (from “retrieval,” coined by Edward Fredkin in 1960) is a tree in which each edge is labeled with a character and each node represents a prefix of the stored strings. Searching for a string of length kk takes O(k)O(k) time, independent of the number of strings stored. The space overhead of tries can be large when many nodes have only one child; Patricia tries (practical algorithm to retrieve information coded in alphanumeric), introduced by Donald Morrison in 1968, compress such chains into single edges. Suffix trees, constructed by Peter Weiner in 1973 and made practical by Esko Ukkonen’s 1995 linear-time algorithm, store all suffixes of a string and support powerful operations: finding any substring in O(k)O(k) time, computing the longest common substring of two strings, and answering various pattern-matching queries. Suffix arrays, introduced by Udi Manber and Gene Myers in 1990, provide similar functionality in less space by storing only the sorted order of suffixes.

For multidimensional data, spatial data structures organize points and regions in two or more dimensions. KD-trees (short for kk-dimensional trees), introduced by Jon Bentley in 1975, recursively partition space by alternating among coordinate axes, supporting nearest-neighbor queries in expected O(logn)O(\log n) time for random point distributions. Quadtrees subdivide two-dimensional space into four quadrants recursively, and octrees do the same in three dimensions, finding extensive use in computer graphics and geographic information systems. R-trees, designed by Antonin Guttman in 1984, group nearby objects into bounding rectangles hierarchically, serving as the standard indexing structure for spatial databases. Segment trees support range queries and point updates on one-dimensional intervals in O(logn)O(\log n) time; with lazy propagation, they handle range updates as well. Fenwick trees (binary indexed trees), invented by Peter Fenwick in 1994, provide a simpler and more cache-friendly alternative for prefix-sum queries and point updates.

Probabilistic and Approximate Data Structures

Some of the most impactful modern data structures deliberately trade exactness for dramatic savings in space or time. A skip list, invented by William Pugh in 1989, is a randomized alternative to balanced trees. It consists of multiple levels of sorted linked lists, where each element is promoted to the next level with probability 1/21/2. Search, insertion, and deletion all take expected O(logn)O(\log n) time with high probability. Skip lists are simpler to implement than red-black trees and are the basis of the ordered data structures in several high-performance systems.

A Bloom filter, introduced by Burton Howard Bloom in 1970, tests membership in a set using a bit array and kk independent hash functions. An element is “present” if all kk corresponding bits are set. False positives are possible (the filter may report an element present when it is not), but false negatives are impossible (if the filter says an element is absent, it truly is). With an optimal choice of k(m/n)ln2k \approx (m/n) \ln 2, where mm is the number of bits and nn is the number of elements, the false positive rate is approximately (1ekn/m)k(1 - e^{-kn/m})^k. Bloom filters are used in database engines to avoid unnecessary disk lookups, in network routers for packet classification, and in distributed systems for set reconciliation. Counting Bloom filters extend the basic design by replacing each bit with a counter, allowing deletions at the cost of additional space.

Probabilistic sketches estimate aggregate statistics over data streams. The Count-Min sketch, introduced by Graham Cormode and S. Muthukrishnan in 2005, estimates the frequency of any element in a stream using a two-dimensional array of counters and multiple hash functions, guaranteeing that estimates are never too low and are too high by at most ϵN\epsilon N (where NN is the stream length) with probability 1δ1 - \delta, using O((1/ϵ)log(1/δ))O((1/\epsilon) \log(1/\delta)) space. The HyperLogLog algorithm, developed by Philippe Flajolet, Eric Fusy, Olivier Gandouet, and Frederic Meunier in 2007, estimates the number of distinct elements in a stream using only O(loglogn)O(\log \log n) bits per register, achieving relative error of roughly 1.04/m1.04 / \sqrt{m} with mm registers. Locality-sensitive hashing (LSH) maps similar items to the same hash bucket with high probability, enabling approximate nearest-neighbor search in high-dimensional spaces — a technique critical to recommendation systems, image retrieval, and large-scale similarity search.

Persistent, Concurrent, and Cache-Aware Structures

Classical data structures assume a single user modifying a single mutable copy. Modern computing demands more. A persistent data structure preserves all previous versions of itself as it is modified, so that any past version can be queried at any time. Path copying achieves persistence for tree-based structures by copying only the nodes on the path from the root to the modified node, sharing the rest with the previous version. For balanced trees with height O(logn)O(\log n), this costs O(logn)O(\log n) extra space per update. Persistent segment trees enable powerful techniques in competitive programming and database versioning. Persistent data structures are natural in functional programming languages, where immutability is the default, and they support undo/redo operations, version control, and bitemporal queries in database systems.

Concurrent data structures are designed for safe and efficient access by multiple threads simultaneously. Lock-based approaches use mutual exclusion to protect shared state, but locks can cause contention, deadlock, and priority inversion. Lock-free data structures use atomic hardware primitives like compare-and-swap (CAS) to achieve progress guarantees without locks. The Michael-Scott queue (1996) is a classic lock-free FIFO queue. Lock-free skip lists provide concurrent ordered maps. Concurrent hash tables use techniques like striped locking (locking only a portion of the table) or lock-free insertion with CAS. The theoretical framework of linearizability, introduced by Maurice Herlihy and Jeannette Wing in 1990, provides the correctness criterion: every concurrent execution must appear equivalent to some sequential execution.

Cache-oblivious data structures, a concept introduced by Matteo Frigo, Charles Leiserson, Harald Prokop, and Sridhar Ramachandran in 1999, are designed to perform well on any memory hierarchy without knowing the cache parameters (cache size and line size). The van Emde Boas layout for search trees arranges nodes in memory so that any subtree of height hh fits in roughly 2h2^h contiguous locations, ensuring O(logBn)O(\log_B n) cache misses per search, where BB is the cache line size. Succinct data structures push space efficiency to the information-theoretic limit, representing data in essentially the minimum number of bits while still supporting efficient queries. The rank and select operations on bit vectors, which answer “how many 1-bits are in the first ii positions?” and “where is the jj-th 1-bit?”, form the foundation for succinct representations of trees, graphs, and text indexes. These advanced structures reflect the ongoing evolution of data structure design in response to the realities of modern hardware — deep memory hierarchies, massive parallelism, and the ever-growing scale of data.