Distributed Systems

Coordination, consistency, and fault tolerance across machines.


foundation tier

Distributed Systems addresses coordination, consistency, and fault tolerance across machines. It sits within Systems and inherits that area’s core questions about correctness, scale, and tractability. This page surveys the conceptual axes of the topic and points to the references that frame ongoing research and teaching. The intent is to be useful both as an entry point for newcomers and as an index for practitioners cross-checking their mental model against the field’s primary sources.

Work on distributed systems can be organised around a few interlocking concerns: the formal objects under study, the algorithms or systems that compute over them, the resource trade-offs (time, memory, communication, statistical efficiency), and the empirical or theoretical guarantees that practitioners rely on. The sources cited below approach the topic from a mix of these angles.

Foundational references

Kleppmann, Designing Data-Intensive Applications (2017) is a standard reference for this material and is used both as a curriculum anchor and as a long-form survey of techniques. Tanenbaum, Distributed Systems: Principles and Paradigms (2017) is a standard reference for this material and is used both as a curriculum anchor and as a long-form survey of techniques.

Historical context

The Part-Time Parliament (Lamport, 1998) situates the topic in its historical trajectory; revisiting it clarifies which ideas in current practice are recent and which trace back to the field’s founding texts.

Open methodological questions in distributed systems cluster around how to compose the techniques above under realistic constraints — scale, adversarial inputs, partial observability, and shifting workloads. The cited references give the precise statements, proofs, and empirical evaluations that this overview only sketches; downstream topic pages drill into specific subfields.

Prerequisites

Sources

  • textbook · primary · 2017
    Designing Data-Intensive Applications
    kleppmann-2017
  • textbook · primary · 2017
    Distributed Systems: Principles and Paradigms
    tanenbaum-2017
  • paper · historical · 1998
    lamport-1998

In context

Where this topic sits in the prerequisite graph. Click any node to jump.

Open in full atlas →

Explore

  1. 01

    Consensus Protocols

    Paxos, Raft, and view-stamped replication.

  2. 02

    Consistency Models

    Linearizability, serializability, and weak consistency.

  3. 03

    Replication

    Primary-backup, chain, and quorum-based replication.

  4. 04

    Distributed Transactions

    2PC, 3PC, and modern distributed transaction protocols.

  5. 05

    Distributed Storage

    Object stores, distributed file systems, and erasure coding.

  6. 06

    Distributed Coordination

    ZooKeeper, etcd, and coordination services.

  7. 07

    Failure Detection

    Phi-accrual and SWIM-style failure detectors.

  8. 08

    Conflict-Free Replicated Data Types

    Commutative and convergent replicated data types.

  9. 09

    Distributed ML Systems

    Parameter servers, all-reduce, and training-system design.

  10. 10

    Serverless Computing

    Function-as-a-service runtimes and resource elasticity.

  11. 11

    Edge Computing

    Placing compute close to data sources at the network edge.

  12. 12

    Blockchain Systems

    Bitcoin/Ethereum-style decentralized ledgers and consensus.

  13. 13

    Distributed Tracing

    Causal tracing and end-to-end observability in microservices.


Review this topic

This page was drafted by an agent and is waiting on expert review. Spotted a wrong prerequisite, a missing concept, a misattributed source, or a factual slip? Tell us — your review opens a tracked issue maintainers act on.