Operating Systems

The software that manages hardware resources and provides services to programs — processes, memory, file systems, and scheduling.


An operating system is the layer of software that sits between application programs and the bare hardware, managing resources, enforcing protection, and providing the abstractions that make modern computing possible. Without an operating system, every program would need to manage its own memory, communicate directly with every peripheral device, and coordinate with every other running program — a situation that was tolerable on the single-user batch machines of the 1950s but wholly unworkable for the multitasking, multi-user, networked systems of today. The operating system is, in a precise sense, the software that turns a computer into a useful tool.

Fundamentals and Kernel Architecture

The central role of an operating system is resource management: it allocates the processor, memory, storage, and I/O devices among competing programs, ensuring fair and efficient use. To do this safely, the hardware provides at least two privilege levelskernel mode (where the OS runs, with full access to hardware) and user mode (where applications run, with restricted access). System calls provide the controlled gateway between user mode and kernel mode, allowing applications to request OS services without compromising system integrity.

The internal structure of operating system kernels has been debated since the field’s inception. Monolithic kernels, exemplified by Linux and traditional Unix, place all OS services — file systems, device drivers, networking, scheduling — in a single large program running in kernel mode. This approach offers excellent performance (no overhead for crossing protection boundaries between subsystems) but makes the kernel complex and difficult to verify. Microkernels, pioneered by Andrew Tanenbaum’s MINIX and formalized in the Mach project at Carnegie Mellon, move most services into user-space processes, leaving only minimal functionality — address space management, inter-process communication, and basic scheduling — in the kernel itself. The microkernel approach improves modularity and fault isolation but historically suffered from IPC overhead. Modern microkernels like seL4 have largely overcome this penalty, and seL4 holds the distinction of being the world’s first formally verified general-purpose operating system kernel. Hybrid kernels, such as those in Windows NT and macOS (based on the XNU kernel, which combines Mach and BSD components), attempt to blend the performance of monolithic kernels with the modularity of microkernels.

The boot process illustrates how control passes from firmware to the operating system. When power is applied, the processor begins executing firmware (UEFI or legacy BIOS), which performs hardware initialization, locates a bootloader on disk, and transfers control to it. The bootloader loads the kernel image into memory and jumps to its entry point. The kernel then initializes its own data structures, detects and configures hardware devices, mounts the root file system, and launches the first user-space process — traditionally called init on Unix systems, now often replaced by systemd on Linux.

Processes, Threads, and Scheduling

A process is the operating system’s abstraction of a running program: it encapsulates a program’s code, its data, its open files, and its execution state (register values, program counter, stack pointer) in a structure called the Process Control Block (PCB). Processes are created (via system calls like fork on Unix), transition through states (new, ready, running, waiting, terminated), and are destroyed when they complete or are killed. Context switching — saving the state of one process and loading the state of another — is the mechanism by which a single processor appears to run multiple programs simultaneously.

Threads are a finer-grained unit of execution: multiple threads within a single process share the same address space and open files but have independent stacks, program counters, and register sets. Threads are cheaper to create and switch between than processes, making them the preferred unit of concurrency for most applications. The mapping between user-level threads (managed by a thread library) and kernel-level threads (managed by the OS scheduler) admits several models: one-to-one (each user thread maps to a kernel thread, as in modern Linux and Windows), many-to-one (multiple user threads map to a single kernel thread, offering cheap creation but no true parallelism), and many-to-many (a pool of kernel threads services a larger pool of user threads).

CPU scheduling determines which ready process or thread gets to run next and for how long. The simplest algorithm is First-Come, First-Served (FCFS), which runs each job to completion in arrival order — fair, but prone to the convoy effect, where short jobs wait behind long ones. Shortest Job First (SJF) minimizes average waiting time but requires knowledge of future burst lengths. Round-Robin (RR) scheduling assigns each process a fixed time quantum and cycles through the ready queue, providing good interactive responsiveness at the cost of higher context-switching overhead. Priority scheduling assigns each process a priority and always runs the highest-priority process, but risks starvation of low-priority processes unless aging (gradually increasing the priority of waiting processes) is employed.

The most sophisticated practical scheduler is the Multi-Level Feedback Queue (MLFQ), which maintains multiple queues at different priority levels and moves processes between them based on their observed behavior. CPU-bound processes that use their entire quantum are demoted to lower-priority queues; I/O-bound processes that yield the CPU quickly are promoted. This adaptive strategy approximates SJF without requiring advance knowledge of burst times. For real-time systems, where missing a deadline can have catastrophic consequences, specialized algorithms like Rate Monotonic Scheduling (RMS) and Earliest Deadline First (EDF) provide provable guarantees. EDF is theoretically optimal for uniprocessor systems: it can schedule any feasible set of periodic tasks with deadlines, using the processor up to 100 percent utilization.

Memory Management and Virtual Memory

The operating system must allocate physical memory among processes efficiently and protect each process’s memory from unauthorized access by others. Early systems used contiguous allocation, assigning each process a single block of physical memory, but this leads to fragmentation — unusable gaps between allocated blocks. Paging solves the fragmentation problem by dividing both virtual and physical memory into fixed-size units: pages (virtual) and frames (physical). Any virtual page can be mapped to any physical frame, so a process’s memory need not be contiguous in physical memory.

The mapping from virtual pages to physical frames is recorded in a page table, one per process. When the CPU generates a virtual address, the Memory Management Unit (MMU) extracts the page number, looks it up in the page table, and produces the corresponding physical frame number. To avoid the cost of accessing the page table in main memory on every instruction, the hardware maintains a Translation Lookaside Buffer (TLB) — a small, fast cache of recent translations. TLB miss rates are typically below 1 percent, but when a miss occurs, the cost of walking a multi-level page table is significant.

Virtual memory extends paging by allowing more virtual memory to be allocated than physical memory exists, using disk as a backing store. When a process accesses a page that is not currently in physical memory — a page fault — the operating system loads the page from disk, possibly evicting another page to make room. Page replacement algorithms determine which page to evict. The theoretically optimal algorithm (Belady’s algorithm) evicts the page that will not be used for the longest time, but this requires future knowledge. Practical algorithms approximate this: LRU (Least Recently Used) evicts the page that has gone longest without access, while clock (second-chance) approximates LRU cheaply using a reference bit. The working set model, introduced by Peter Denning in 1968, captures the set of pages a process is actively using and provides a framework for understanding thrashing — the pathological state where the system spends more time paging than computing because processes’ combined working sets exceed physical memory.

Segmentation offers an alternative (or complement) to paging by dividing a process’s address space into logical segments — code, data, stack, heap — each with its own base address and length. Segments provide a natural unit for protection (a code segment can be marked read-only and executable) and sharing (two processes can share a library segment). The x86 architecture historically combined segmentation and paging; modern 64-bit operating systems use a flat memory model with paging alone.

File Systems and Storage

The file system provides the persistent storage abstraction: it organizes data into named files and hierarchical directories, manages the allocation of disk blocks, and ensures data integrity in the face of crashes. File systems must bridge the gap between the logical view (a tree of named files with arbitrary contents) and the physical reality (a disk or SSD with fixed-size blocks, slow random access, and a finite lifetime).

The fundamental design question is how to map a file’s logical blocks to physical disk blocks. Contiguous allocation stores each file as a sequence of adjacent blocks — simple and fast for sequential access, but prone to external fragmentation. Linked allocation chains blocks together with pointers, eliminating fragmentation but making random access expensive. Indexed allocation, used by the Unix file system (UFS) and its descendants, stores pointers to all of a file’s blocks in an inode (index node). The classic Unix inode contains direct pointers (for small files), single-indirect pointers (pointing to a block of pointers), double-indirect, and triple-indirect pointers, supporting files up to terabytes in size with efficient access to both small and large files.

Free space management tracks which disk blocks are available. Bitmaps — one bit per block — are compact and allow fast allocation of contiguous runs; free lists are simpler but less efficient for finding contiguous space. Journaling (also called write-ahead logging), introduced in file systems like ext3, NTFS, and XFS, addresses the crash-consistency problem: by writing a description of each file system operation to a journal before performing it, the system can recover from crashes by replaying or undoing incomplete operations, avoiding the need for a full file-system check.

Modern file systems go further. Copy-on-write (COW) file systems like ZFS and Btrfs never overwrite data in place; instead, they write new data to a new location and atomically update the metadata pointer. This provides free snapshots, simplified crash recovery, and built-in checksumming for data integrity. Log-structured file systems (LFS), proposed by Mendel Rosenblum and John Ousterhout in 1991, treat the entire disk as a sequential log, optimizing for write-heavy workloads and the characteristics of solid-state drives. Flash-aware file systems like F2FS are designed explicitly for the NAND flash storage that dominates modern devices.

Concurrency and Synchronization

When multiple threads or processes share resources, the operating system must provide mechanisms to coordinate their access and prevent race conditions — situations where the outcome depends on the unpredictable timing of interleaved operations. The fundamental abstraction is mutual exclusion: ensuring that only one thread at a time executes a critical section of code that accesses shared data.

Locks (or mutexes) are the simplest mutual exclusion primitive: a thread acquires the lock before entering the critical section and releases it upon exit. In hardware, locks are built using atomic instructions like test-and-set or compare-and-swap (CAS), which read and modify a memory location in a single indivisible operation. A spinlock busy-waits (repeatedly checking the lock) and is appropriate when the expected wait is very short; a blocking lock puts the waiting thread to sleep and wakes it when the lock becomes available, saving CPU cycles for longer waits.

Semaphores, introduced by Edsger Dijkstra in 1965, generalize locks to control access to a pool of resources. A semaphore maintains a non-negative integer counter: the wait (P) operation decrements the counter and blocks if it would go negative; the signal (V) operation increments the counter and wakes a blocked thread. A binary semaphore (with initial value 1) behaves like a mutex; a counting semaphore (with initial value nn) allows up to nn threads to access a resource simultaneously. Monitors, proposed by C.A.R. Hoare and Per Brinch Hansen in the early 1970s, combine mutual exclusion with condition variables — a mechanism that allows a thread to wait for a particular condition to become true while atomically releasing the monitor’s lock.

Deadlock occurs when a set of threads are all waiting for resources held by other threads in the set, forming a circular dependency. The four necessary conditions for deadlock, identified by Edward Coffman and colleagues in 1971, are mutual exclusion, hold and wait, no preemption, and circular wait. Operating systems address deadlock through prevention (designing the system to violate one of the four conditions), avoidance (using algorithms like the Banker’s algorithm, due to Dijkstra, which checks whether granting a resource request would lead to an unsafe state), detection (periodically searching for cycles in the resource-allocation graph), or — most commonly in practice — simply ignoring the problem and relying on developers to write deadlock-free code.

The classic synchronization problems — the dining philosophers, the producer-consumer (bounded buffer), and the readers-writers problem — serve as canonical illustrations of concurrency challenges and as testbeds for synchronization primitives. Each captures a different pattern of contention and cooperation that recurs throughout operating system and application design.

Virtualization and Containers

Virtualization allows a single physical machine to host multiple isolated virtual machines (VMs), each running its own operating system. The concept dates to the IBM CP/CMS system of the late 1960s, but it was the introduction of hardware virtualization support in x86 processors (Intel VT-x in 2005, AMD-V in 2006) that made virtualization practical and ubiquitous. A hypervisor (or virtual machine monitor) manages the virtual machines. Type-1 (bare-metal) hypervisors, such as VMware ESXi and Xen, run directly on the hardware; Type-2 (hosted) hypervisors, such as VirtualBox, run atop a conventional operating system.

Memory virtualization requires mapping a guest’s virtual addresses to the host’s physical addresses — a two-level translation. Shadow page tables maintain a direct virtual-to-physical mapping for the guest, but they are expensive to maintain. Nested page tables (Intel’s Extended Page Tables, or EPT; AMD’s Nested Page Tables, or NPT) provide hardware support for the two-level walk, significantly reducing overhead. Memory ballooning allows the hypervisor to reclaim physical memory from guests by inflating a balloon driver inside the guest OS, which forces the guest to page out its own data.

Containers offer a lighter-weight alternative to full virtualization. Rather than running a complete guest OS, containers share the host kernel and achieve isolation through OS-level mechanisms. On Linux, namespaces isolate process IDs, network stacks, mount points, and user IDs, giving each container the illusion of being the only tenant. Control groups (cgroups) limit and account for each container’s CPU, memory, and I/O usage. Union file systems (such as OverlayFS) allow containers to share read-only base layers while maintaining private writable layers, dramatically reducing storage overhead. Container orchestration platforms like Kubernetes manage the deployment, scaling, and networking of containerized applications across clusters of machines, forming the backbone of modern cloud infrastructure.

Security and Protection

Operating system security encompasses the mechanisms that protect users, processes, and data from unauthorized access and malicious behavior. The foundation is the protection ring model, where code at different privilege levels (kernel, drivers, user) has different permissions. Access control determines which subjects (users, processes) can perform which operations on which objects (files, devices, memory regions).

Discretionary Access Control (DAC), the default model in Unix and Windows, allows resource owners to set permissions. The Unix permission model assigns read, write, and execute bits to the owner, group, and others for each file. Mandatory Access Control (MAC), used in high-security systems, enforces system-wide policies that even resource owners cannot override. The Bell-LaPadula model (1973) formalizes confidentiality-oriented MAC with the “no read up, no write down” rules, while the Biba model addresses integrity with complementary rules. Linux implementations like SELinux (developed by the NSA) and AppArmor provide practical MAC frameworks.

Modern operating systems deploy multiple layers of exploit mitigation. Address Space Layout Randomization (ASLR) randomizes the memory layout of a process, making it difficult for an attacker to predict the location of code or data. Stack canaries place sentinel values before return addresses on the stack; if a buffer overflow overwrites the canary, the system detects the corruption before the attacker can hijack control flow. Data Execution Prevention (DEP), also called the NX (No-Execute) bit, marks data regions as non-executable, preventing the classic technique of injecting shellcode into a buffer and executing it. Control Flow Integrity (CFI) restricts the set of targets for indirect jumps and calls, thwarting return-oriented programming (ROP) attacks that chain together existing code fragments.

System call filtering mechanisms like seccomp on Linux allow processes to restrict the set of system calls they can make, limiting the damage an exploited process can cause. Capability-based security replaces ambient authority (where a process inherits all the privileges of its user) with explicit tokens granting specific permissions, following the principle of least privilege. Together, these mechanisms form a defense-in-depth strategy, recognizing that no single layer of protection is sufficient against determined attackers.